What is Semantic Validation?
Semantic validation means checking if a response is correct in meaning , not just in exact wording.
The Problem
// User asks: "Schedule a meeting for tomorrow at 2pm"
// All these responses are semantically correct:
"✅ I've scheduled your meeting for 2:00 PM tomorrow"
"✅ Done! Your 2pm meeting is set for tomorrow"
"✅ Meeting booked: Tomorrow at 14:00"
"✅ Your meeting tomorrow at 2 is confirmed"
// Traditional exact matching fails on all but one ❌
The Solution: LLMJudge
LLMJudge uses OpenAI models to evaluate whether responses meet your semantic requirements:
{
"block" : "LLMJudge" ,
"input" : {
"text" : "${aiResponse.text}" ,
"expected" : {
"expectedBehavior" : "Confirms meeting scheduled for 2pm tomorrow"
}
},
"output" : "judgement"
}
Output:
{
score : 0.95 ,
passed : true ,
reasoning : "Response clearly confirms meeting time (2pm) and date (tomorrow)"
}
When to Use LLMJudge
✅ Use LLMJudge
❌ Don't Use LLMJudge
Non-deterministic text content:
AI chat responses
Generated summaries
Creative content
Explanations
Natural language descriptions
Example: {
"block" : "LLMJudge" ,
"input" : {
"text" : "${aiResponse}" ,
"expected" : {
"expectedBehavior" : "Explains the concept clearly in simple terms"
}
}
}
Setup
1. Get OpenAI API Key
Sign up at platform.openai.com and create an API key.
2. Set Environment Variable
export OPENAI_API_KEY = "sk-..."
Or use a .env
file:
# .env
OPENAI_API_KEY = sk-...
Cost: LLMJudge uses GPT-4o-mini by default (~0.0001 p e r e v a l u a t i o n ) . A t y p i c a l t e s t w i t h 10 A I r e s p o n s e s c o s t s 0.0001 per evaluation). A typical test with 10 AI responses costs ~ 0.0001 p ere v a l u a t i o n ) . A t y p i c a lt es tw i t h 10 A I res p o n sescos t s 0.001.
Basic Usage
Simple Validation
{
"pipeline" : [
{
"block" : "HttpRequest" ,
"input" : {
"url" : "${AI_API}/chat" ,
"method" : "POST" ,
"body" : {
"message" : "What's the capital of France?"
}
},
"output" : "response"
},
{
"block" : "JsonParser" ,
"input" : "${response.body}" ,
"output" : { "parsed" : "data" }
},
{
"block" : "LLMJudge" ,
"input" : {
"text" : "${data.message}" ,
"expected" : {
"expectedBehavior" : "States that Paris is the capital of France"
}
},
"output" : "judgement"
}
],
"assertions" : {
"judgement.score" : { "gt" : 0.8 }
}
}
Be Specific
{
"expectedBehavior" : "Good response"
}
Too vague - what makes a response “good”?
Include Key Details
{
"expectedBehavior" : "Explains the difference between REST and GraphQL APIs, mentioning at least: request/response structure, over-fetching, and type safety"
}
Use Action Verbs
Good verbs: confirms, explains, lists, mentions, asks, provides, includes, summarizes
{
"expectedBehavior" : "Lists 3-5 main advantages of TypeScript over JavaScript"
}
Custom Scoring Criteria
Define your own scoring dimensions:
{
"block" : "LLMJudge" ,
"input" : {
"text" : "${aiResponse}" ,
"expected" : {
"expectedBehavior" : "Helpful customer service response"
}
},
"config" : {
"criteria" : {
"accuracy" : 0.5 ,
"completeness" : 0.3 ,
"relevance" : 0.2
}
},
"output" : "judgement"
}
Output:
{
score : 0.87 ,
reasoning : "Response is accurate and complete..." ,
details : {
accuracy : 0.9 ,
completeness : 0.85 ,
relevance : 0.87
}
}
Available Criteria
LLMJudge supports three scoring criteria. You can adjust their weights:
{
"config" : {
"criteria" : {
"accuracy" : 0.4 , // How correct is the response? (default: 0.4)
"completeness" : 0.3 , // Does it cover all required points? (default: 0.3)
"relevance" : 0.3 // Is it relevant to the question? (default: 0.3)
}
}
}
Weights must sum to 1.0. The three criteria are hardcoded and cannot be customized, but you can adjust their relative importance by changing the weights.
Example Weight Adjustments
Prioritize Accuracy
Balanced
Prioritize Completeness
{
"criteria" : {
"accuracy" : 0.6 ,
"completeness" : 0.2 ,
"relevance" : 0.2
}
}
Use when correctness is most important (e.g., factual questions).
Model Selection
Choose the right model for your use case:
{
"config" : {
"model" : "gpt-4o-mini" // or "gpt-4o", "gpt-4-turbo"
}
}
Model Comparison
Model Speed Cost Accuracy Use Case gpt-4o-mini ⚡ Fast 💰 $0.0001 ⭐⭐⭐ Good Recommended - Fast, cheap, accurategpt-4o ⚡ Fast 💰💰 $0.0025 ⭐⭐⭐⭐ Better Complex evaluations gpt-4-turbo 🐌 Slower 💰💰💰 $0.01 ⭐⭐⭐⭐⭐ Best Critical validations
Start with gpt-4o-mini. It’s 25x cheaper than gpt-4-turbo and handles 95% of use cases well.
Real-World Examples
1. Testing ChatGPT-Style Interface
{
"name" : "Chatbot Quality Test" ,
"context" : {
"CHAT_URL" : "${env.CHAT_API_URL}"
},
"tests" : [{
"id" : "test-helpful-response" ,
"pipeline" : [
{
"block" : "HttpRequest" ,
"input" : {
"url" : "${CHAT_URL}/chat" ,
"method" : "POST" ,
"body" : {
"message" : "How do I reset my password?"
}
},
"output" : "response"
},
{
"block" : "JsonParser" ,
"input" : "${response.body}" ,
"output" : { "parsed" : "data" }
},
{
"block" : "LLMJudge" ,
"input" : {
"text" : "${data.reply}" ,
"expected" : {
"expectedBehavior" : "Provides clear step-by-step instructions to reset password, mentioning the settings or account page, being polite and helpful, and not asking for current password in chat"
}
},
"config" : {
"criteria" : {
"accuracy" : 0.4 ,
"completeness" : 0.4 ,
"relevance" : 0.2
}
},
"output" : "judgement"
}
],
"assertions" : {
"response.status" : 200 ,
"judgement.score" : { "gte" : 0.8 },
"judgement.details.completeness" : { "gte" : 0.7 }
}
}]
}
2. Testing Content Summarization
{
"tests" : [{
"id" : "test-article-summary" ,
"pipeline" : [
{
"block" : "HttpRequest" ,
"input" : {
"url" : "${AI_API}/summarize" ,
"method" : "POST" ,
"body" : {
"text" : "Long article text here..." ,
"maxLength" : 100
}
},
"output" : "response"
},
{
"block" : "JsonParser" ,
"input" : "${response.body}" ,
"output" : { "parsed" : "data" }
},
{
"block" : "ValidateContent" ,
"input" : {
"from" : "data.summary" ,
"as" : "text"
},
"config" : {
"maxLength" : 120
},
"output" : "lengthCheck"
},
{
"block" : "LLMJudge" ,
"input" : {
"text" : "${data.summary}" ,
"expected" : {
"expectedBehavior" : "Captures the main points: company revenue growth, new product launch, and market expansion plans"
}
},
"output" : "judgement"
}
],
"assertions" : {
"lengthCheck.passed" : true ,
"judgement.score" : { "gt" : 0.85 }
}
}]
}
3. Testing Code Generation
{
"tests" : [{
"id" : "test-code-explanation" ,
"pipeline" : [
{
"block" : "HttpRequest" ,
"input" : {
"url" : "${AI_API}/explain-code" ,
"method" : "POST" ,
"body" : {
"code" : "const sum = arr => arr.reduce((a,b) => a+b, 0)" ,
"language" : "javascript"
}
},
"output" : "response"
},
{
"block" : "JsonParser" ,
"input" : "${response.body}" ,
"output" : { "parsed" : "data" }
},
{
"block" : "LLMJudge" ,
"input" : {
"text" : "${data.explanation}" ,
"expected" : {
"expectedBehavior" : "Explains that the code sums array elements using reduce, mentioning the accumulator pattern in a beginner-friendly way, explaining the reduce function and the initial value (0)"
}
},
"config" : {
"criteria" : {
"accuracy" : 0.5 ,
"completeness" : 0.3 ,
"relevance" : 0.2
}
},
"output" : "judgement"
}
],
"assertions" : {
"judgement.details.accuracy" : { "gte" : 0.9 }
}
}]
}
Cost Optimization
LLMJudge makes API calls to OpenAI. Here’s how to minimize costs:
1. Use gpt-4o-mini
{
"config" : {
"model" : "gpt-4o-mini" // 25x cheaper than gpt-4-turbo
}
}
Savings: ~0.99 p e r 100 e v a l u a t i o n s v s 0.99 per 100 evaluations vs 0.99 p er 100 e v a l u a t i o n s v s 25 with gpt-4-turbo
2. Combine with Traditional Validation
Don’t use LLMJudge for everything:
{
"pipeline" : [
// Fast, free structural checks first
{
"block" : "ValidateContent" ,
"config" : {
"minLength" : 10 ,
"contains" : [ "order" , "confirmed" ]
}
},
// Then semantic validation
{
"block" : "LLMJudge" ,
"input" : {
"expected" : {
"expectedBehavior" : "Professional and reassuring tone"
}
}
}
]
}
3. Cache Common Validations
If testing same AI responses repeatedly during development:
# Run once, save results
npx semtest tests.json --html --output results.html
# Review results.html instead of re-running
4. Test Samples in CI/CD
{
"tests" : [
{
"id" : "quick-smoke-test" ,
"pipeline" : [
// Test 1 representative example
]
}
]
}
Run full suite less frequently (nightly builds).
Best Practices
1. Be Specific in Expected Behavior
// ❌ Too vague
"expectedBehavior" : "Good answer"
// ✅ Specific and measurable
"expectedBehavior" : "Lists 3-5 benefits of exercise, mentions both physical and mental health"
2. Be Specific About Requirements
{
"expected" : {
"expectedBehavior" : "Provides medical information including a disclaimer to consult a doctor. Must not diagnose conditions or prescribe medications, only provide general health information."
}
}
3. Set Appropriate Score Thresholds
{
"assertions" : {
// Strict for critical features
"criticalCheck.score" : { "gte" : 0.95 },
// Lenient for nice-to-haves
"toneCheck.score" : { "gte" : 0.7 }
}
}
4. Combine Multiple Validation Types
{
"pipeline" : [
// Structure check (free, fast)
{ "block" : "ValidateContent" , "config" : { "contains" : "order" } },
// Tool validation (free, fast)
{ "block" : "ValidateTools" , "config" : { "expected" : [ "send_email" ] } },
// Semantic validation (costs $, slower)
{ "block" : "LLMJudge" , "input" : { "expected" : { ... } } }
]
}
5. Include Context When Needed
For conversations, pass history: {
"block" : "LLMJudge" ,
"input" : {
"text" : "${aiResponse}" ,
"history" : [
{ "role" : "user" , "content" : "I'm looking for a laptop" },
{ "role" : "assistant" , "content" : "What's your budget?" },
{ "role" : "user" , "content" : "Around $1000" }
],
"expected" : {
"expectedBehavior" : "Recommends laptops in $800-1200 range"
}
}
}
Troubleshooting
Low Scores Despite Good Responses
Problem: judgement.score is 0.6 but response looks good
Solutions:
Make expectedBehavior more specific
Lower score threshold: { "gte": 0.7 }
instead of { "gte": 0.9 }
Check if response actually meets expectations
High Costs
Problem: Test suite costs $5+ per run
Solutions:
Switch to gpt-4o-mini
Use traditional validation where possible
Test fewer examples in CI
Cache results during development
Inconsistent Scores
Problem: Same response gets different scores (0.85, then 0.78, then 0.92)
Solution: This is normal with AI. Set threshold ranges:
{
"assertions" : {
"judgement.score" : { "gte" : 0.75 } // Allow some variance
}
}
Missing Context
Problem: LLMJudge doesn’t understand multi-turn conversation
Solution: Pass conversation history:
{
"input" : {
"text" : "${aiResponse}" ,
"history" : [ ...previous messages... ]
}
}
Next Steps