Skip to main content

The Challenge of Testing AI

Traditional API testing uses exact matching:
{
  "assertions": {
    "response.message": "User created successfully"
  }
}
This breaks immediately with AI responses. Consider asking an AI to schedule a meeting: Expected: “I’ve scheduled your meeting for 2 PM on Tuesday.” Actual responses:
  • ✅ “Your meeting is set for Tuesday at 2:00 PM”
  • ✅ “Done! Meeting scheduled for 14:00 on Tuesday”
  • ✅ “I’ve booked your Tuesday 2pm slot”
  • ❌ All fail with exact matching!

Why AI Testing is Different

1. Non-Deterministic Responses

Same input → different outputs every time:
Input: "Summarize this article"
Output 1: "This article discusses..."
Output 2: "The piece explores..."
Output 3: "Key points include..."

2. Semantic Equivalence

These mean the same thing:
  • “Cannot find user”
  • “User not found”
  • “That user doesn’t exist”
  • “I couldn’t locate that user”

3. Tool/Function Calling

AI agents call tools unpredictably:
// Both valid for "book a meeting"
[{ name: "search_calendar" }, { name: "create_event" }]
[{ name: "check_availability" }, { name: "create_event" }, { name: "send_confirmation" }]

4. Streaming Responses

Responses arrive as chunks over Server-Sent Events (SSE):
data: {"content":"Hello"}
data: {"content":" there"}
data: {"content":"!"}

SemanticTest’s Approach

SemanticTest provides 4 specialized approaches for AI testing:

1. Semantic Validation (LLMJudge)

Use AI to judge AI responses:
{
  "block": "LLMJudge",
  "input": {
    "text": "${aiResponse.text}",
    "expected": {
      "expectedBehavior": "Should confirm meeting time and location"
    }
  },
  "output": "judgement"
}
Result:
{
  score: 0.95,
  passed: true,
  reasoning: "Response confirms meeting time (2 PM) and location (Conference Room A)"
}

2. Tool Call Validation

Validate what tools AI calls and with what arguments:
{
  "block": "ValidateTools",
  "input": {
    "from": "aiResponse.toolCalls",
    "as": "toolCalls"
  },
  "config": {
    "expected": ["search_calendar", "create_event"],
    "order": ["search_calendar", "create_event"]
  }
}

3. Streaming Response Parsing

Parse SSE streams from AI APIs:
{
  "block": "StreamParser",
  "input": "${response.body}",
  "config": {
    "format": "sse-openai"
  },
  "output": {
    "text": "aiMessage",
    "toolCalls": "aiTools"
  }
}

4. Multi-Turn Conversation Testing

Test conversational flows with context:
{
  "block": "LLMJudge",
  "input": {
    "text": "${aiResponse.text}",
    "history": [
      { "role": "user", "content": "Hello" },
      { "role": "assistant", "content": "Hi! How can I help?" },
      { "role": "user", "content": "Book a meeting" }
    ],
    "expected": {
      "expectedBehavior": "Should remember context and ask for meeting details"
    }
  }
}

Traditional vs Semantic Testing

{
  "pipeline": [
    {
      "block": "HttpRequest",
      "input": {
        "url": "${AI_API}/chat",
        "method": "POST",
        "body": {
          "message": "Schedule meeting for tomorrow at 2pm"
        }
      },
      "output": "response"
    }
  ],
  "assertions": {
    "response.body": "I've scheduled your meeting for 2:00 PM tomorrow."
  }
}
Problem: Fails if AI says “Done! Meeting set for 2pm tomorrow” or any variation.

What You’ll Learn

Semantic Validation

Use LLMJudge to validate AI responses semantically instead of exact matches

Tool Call Validation

Test AI agents that call tools/functions with ValidateTools

Streaming Responses

Parse and test Server-Sent Events (SSE) streams from AI APIs

Multi-Turn Conversations

Test conversational AI with context and memory

Quick Start

1. Install SemanticTest

npm install @blade47/semantic-test

2. Set OpenAI API Key (for LLMJudge)

export OPENAI_API_KEY="sk-..."
Optional: LLMJudge requires OpenAI API key. All other blocks work without it!

3. Create Your First AI Test

test.json
{
  "name": "AI Chat Test",
  "context": {
    "AI_URL": "https://api.openai.com/v1/chat/completions",
    "API_KEY": "${env.OPENAI_API_KEY}"
  },
  "tests": [{
    "id": "test-greeting",
    "pipeline": [
      {
        "block": "HttpRequest",
        "input": {
          "url": "${AI_URL}",
          "method": "POST",
          "headers": {
            "Authorization": "Bearer ${API_KEY}",
            "Content-Type": "application/json"
          },
          "body": {
            "model": "gpt-4o-mini",
            "messages": [
              { "role": "user", "content": "Hello!" }
            ]
          }
        },
        "output": "response"
      },
      {
        "block": "JsonParser",
        "input": "${response.body}",
        "output": { "parsed": "data" }
      },
      {
        "block": "LLMJudge",
        "input": {
          "text": "${data.choices[0].message.content}",
          "expected": {
            "expectedBehavior": "Friendly greeting response"
          }
        },
        "output": "judgement"
      }
    ],
    "assertions": {
      "response.status": 200,
      "judgement.score": { "gt": 0.8 }
    }
  }]
}

4. Run Tests

npx semtest test.json

When to Use Each Approach

Use CaseApproachBlock
Validate response quality/meaningSemantic ValidationLLMJudge
Check exact tools calledTool ValidationValidateTools
Parse streaming AI responsesStream ParsingStreamParser
Test conversation flowsMulti-Turn TestingLLMJudge + history
Verify specific keywordsTraditionalValidateContent
Check response structureTraditionalAssertions

Best Practices

Use structural validation (ValidateTools) + semantic validation (LLMJudge):
{
  "pipeline": [
    { "block": "ValidateTools", "config": { "expected": ["search"] } },
    { "block": "LLMJudge", "input": { "expected": { "expectedBehavior": "Polite response" } } }
  ]
}
Exact assertions are faster and cheaper:
{
  "assertions": {
    "response.status": 200,
    "data.id": { "gt": 0 }
  }
}
Reserve LLMJudge for truly non-deterministic content.
Be specific in LLMJudge expectations:
// ❌ Vague
"expectedBehavior": "Good response"

// ✅ Specific
"expectedBehavior": "Confirms order #12345 is shipped and provides tracking number"
AI systems need edge case testing:
  • Ambiguous user input
  • Missing context
  • Conflicting instructions
  • Tool call failures
  • Stream interruptions

Real-World Examples

AI Chat API

Complete example testing OpenAI chat API

Calendar Agent

Test an AI agent that manages calendar events

Basic API Test

Traditional API testing patterns

Error Handling

Handle errors and retries

Next Steps

Semantic Validation with LLMJudge

Start with semantic validation - the core of AI testing