The Challenge of Testing AI
Traditional API testing uses exact matching :
{
"assertions" : {
"response.message" : "User created successfully"
}
}
This breaks immediately with AI responses. Consider asking an AI to schedule a meeting:
Expected: “I’ve scheduled your meeting for 2 PM on Tuesday.”
Actual responses:
✅ “Your meeting is set for Tuesday at 2:00 PM”
✅ “Done! Meeting scheduled for 14:00 on Tuesday”
✅ “I’ve booked your Tuesday 2pm slot”
❌ All fail with exact matching!
Why AI Testing is Different
1. Non-Deterministic Responses
Same input → different outputs every time:
Input: "Summarize this article"
Output 1: "This article discusses..."
Output 2: "The piece explores..."
Output 3: "Key points include..."
2. Semantic Equivalence
These mean the same thing:
“Cannot find user”
“User not found”
“That user doesn’t exist”
“I couldn’t locate that user”
AI agents call tools unpredictably:
// Both valid for "book a meeting"
[{ name: "search_calendar" }, { name: "create_event" }]
[{ name: "check_availability" }, { name: "create_event" }, { name: "send_confirmation" }]
4. Streaming Responses
Responses arrive as chunks over Server-Sent Events (SSE):
data: {"content":"Hello"}
data: {"content":" there"}
data: {"content":"!"}
SemanticTest’s Approach
SemanticTest provides 4 specialized approaches for AI testing:
1. Semantic Validation (LLMJudge)
Use AI to judge AI responses:
{
"block" : "LLMJudge" ,
"input" : {
"text" : "${aiResponse.text}" ,
"expected" : {
"expectedBehavior" : "Should confirm meeting time and location"
}
},
"output" : "judgement"
}
Result:
{
score : 0.95 ,
passed : true ,
reasoning : "Response confirms meeting time (2 PM) and location (Conference Room A)"
}
Validate what tools AI calls and with what arguments:
{
"block" : "ValidateTools" ,
"input" : {
"from" : "aiResponse.toolCalls" ,
"as" : "toolCalls"
},
"config" : {
"expected" : [ "search_calendar" , "create_event" ],
"order" : [ "search_calendar" , "create_event" ]
}
}
3. Streaming Response Parsing
Parse SSE streams from AI APIs:
{
"block" : "StreamParser" ,
"input" : "${response.body}" ,
"config" : {
"format" : "sse-openai"
},
"output" : {
"text" : "aiMessage" ,
"toolCalls" : "aiTools"
}
}
4. Multi-Turn Conversation Testing
Test conversational flows with context:
{
"block" : "LLMJudge" ,
"input" : {
"text" : "${aiResponse.text}" ,
"history" : [
{ "role" : "user" , "content" : "Hello" },
{ "role" : "assistant" , "content" : "Hi! How can I help?" },
{ "role" : "user" , "content" : "Book a meeting" }
],
"expected" : {
"expectedBehavior" : "Should remember context and ask for meeting details"
}
}
}
Traditional vs Semantic Testing
❌ Traditional (Brittle)
✅ Semantic (Robust)
{
"pipeline" : [
{
"block" : "HttpRequest" ,
"input" : {
"url" : "${AI_API}/chat" ,
"method" : "POST" ,
"body" : {
"message" : "Schedule meeting for tomorrow at 2pm"
}
},
"output" : "response"
}
],
"assertions" : {
"response.body" : "I've scheduled your meeting for 2:00 PM tomorrow."
}
}
Problem: Fails if AI says “Done! Meeting set for 2pm tomorrow” or any variation.{
"pipeline" : [
{
"block" : "HttpRequest" ,
"input" : {
"url" : "${AI_API}/chat" ,
"method" : "POST" ,
"body" : {
"message" : "Schedule meeting for tomorrow at 2pm"
}
},
"output" : "response"
},
{
"block" : "StreamParser" ,
"input" : "${response.body}" ,
"config" : { "format" : "sse-openai" },
"output" : { "text" : "aiMessage" , "toolCalls" : "tools" }
},
{
"block" : "ValidateTools" ,
"input" : { "from" : "tools" , "as" : "toolCalls" },
"config" : {
"expected" : [ "create_event" ]
},
"output" : "toolValidation"
},
{
"block" : "LLMJudge" ,
"input" : {
"text" : "${aiMessage}" ,
"expected" : {
"expectedBehavior" : "Confirms meeting scheduled for 2pm tomorrow"
}
},
"output" : "judgement"
}
],
"assertions" : {
"toolValidation.passed" : true ,
"judgement.score" : { "gt" : 0.8 }
}
}
Result: Passes for any semantically correct response.
What You’ll Learn
Semantic Validation Use LLMJudge to validate AI responses semantically instead of exact matches
Tool Call Validation Test AI agents that call tools/functions with ValidateTools
Streaming Responses Parse and test Server-Sent Events (SSE) streams from AI APIs
Multi-Turn Conversations Test conversational AI with context and memory
Quick Start
1. Install SemanticTest
npm install @blade47/semantic-test
2. Set OpenAI API Key (for LLMJudge)
export OPENAI_API_KEY = "sk-..."
Optional: LLMJudge requires OpenAI API key. All other blocks work without it!
3. Create Your First AI Test
{
"name" : "AI Chat Test" ,
"context" : {
"AI_URL" : "https://api.openai.com/v1/chat/completions" ,
"API_KEY" : "${env.OPENAI_API_KEY}"
},
"tests" : [{
"id" : "test-greeting" ,
"pipeline" : [
{
"block" : "HttpRequest" ,
"input" : {
"url" : "${AI_URL}" ,
"method" : "POST" ,
"headers" : {
"Authorization" : "Bearer ${API_KEY}" ,
"Content-Type" : "application/json"
},
"body" : {
"model" : "gpt-4o-mini" ,
"messages" : [
{ "role" : "user" , "content" : "Hello!" }
]
}
},
"output" : "response"
},
{
"block" : "JsonParser" ,
"input" : "${response.body}" ,
"output" : { "parsed" : "data" }
},
{
"block" : "LLMJudge" ,
"input" : {
"text" : "${data.choices[0].message.content}" ,
"expected" : {
"expectedBehavior" : "Friendly greeting response"
}
},
"output" : "judgement"
}
],
"assertions" : {
"response.status" : 200 ,
"judgement.score" : { "gt" : 0.8 }
}
}]
}
4. Run Tests
When to Use Each Approach
Use Case Approach Block Validate response quality/meaning Semantic Validation LLMJudge Check exact tools called Tool Validation ValidateTools Parse streaming AI responses Stream Parsing StreamParser Test conversation flows Multi-Turn Testing LLMJudge + history Verify specific keywords Traditional ValidateContent Check response structure Traditional Assertions
Best Practices
Combine Multiple Validation Approaches
Use structural validation (ValidateTools) + semantic validation (LLMJudge): {
"pipeline" : [
{ "block" : "ValidateTools" , "config" : { "expected" : [ "search" ] } },
{ "block" : "LLMJudge" , "input" : { "expected" : { "expectedBehavior" : "Polite response" } } }
]
}
Use Traditional Assertions Where Possible
Exact assertions are faster and cheaper: {
"assertions" : {
"response.status" : 200 ,
"data.id" : { "gt" : 0 }
}
}
Reserve LLMJudge for truly non-deterministic content.
Set Clear Expected Behavior
Be specific in LLMJudge expectations: // ❌ Vague
"expectedBehavior" : "Good response"
// ✅ Specific
"expectedBehavior" : "Confirms order #12345 is shipped and provides tracking number"
AI systems need edge case testing:
Ambiguous user input
Missing context
Conflicting instructions
Tool call failures
Stream interruptions
Real-World Examples
AI Chat API Complete example testing OpenAI chat API
Calendar Agent Test an AI agent that manages calendar events
Basic API Test Traditional API testing patterns
Error Handling Handle errors and retries
Next Steps
Semantic Validation with LLMJudge Start with semantic validation - the core of AI testing