Overview
This example demonstrates how to test AI chat responses using LLMJudge for semantic validation. Instead of exact text matching, we validate that responses meet behavioral expectations.Complete Test
test.json
Setup
You’ll need an OpenAI API key:.env
Running the Test
Breaking It Down
Good Response Test
1
Mock AI Response
Simulate a successful AI response with tool calls:
2
Semantic Validation
Use LLMJudge to validate the response semantically:
3
Assert High Score
Verify the AI response meets quality threshold:
Bad Response Detection
The second test verifies that poor responses get low scores, ensuring your LLMJudge criteria are working correctly:- Doesn’t use required tools
- Doesn’t provide requested information
- Fails to complete the task
Custom Criteria
Adjust scoring weights based on what matters most:Real-World Usage
Testing OpenAI Chat API
Cost Considerations
LLMJudge uses GPT-4o-mini by default, which costs approximately:- ~$0.0001 per evaluation
- ~$0.001 for 10 tests
- ~$0.10 for 1000 tests
- Testing a sample of responses (not every one)
- Using faster traditional validation where possible
- Running full semantic tests nightly