Skip to main content

The Challenge of Testing AI

Traditional API testing uses exact matching:
{
  "assertions": {
    "response.message": "User created successfully"
  }
}
This breaks immediately with AI responses. Consider asking an AI to schedule a meeting: Expected: “I’ve scheduled your meeting for 2 PM on Tuesday.” Actual responses:
  • ✅ “Your meeting is set for Tuesday at 2:00 PM”
  • ✅ “Done! Meeting scheduled for 14:00 on Tuesday”
  • ✅ “I’ve booked your Tuesday 2pm slot”
  • ❌ All fail with exact matching!

Why AI Testing is Different

1. Non-Deterministic Responses

Same input → different outputs every time:
Input: "Summarize this article"
Output 1: "This article discusses..."
Output 2: "The piece explores..."
Output 3: "Key points include..."

2. Semantic Equivalence

These mean the same thing:
  • “Cannot find user”
  • “User not found”
  • “That user doesn’t exist”
  • “I couldn’t locate that user”

3. Tool/Function Calling

AI agents call tools unpredictably:
// Both valid for "book a meeting"
[{ name: "search_calendar" }, { name: "create_event" }]
[{ name: "check_availability" }, { name: "create_event" }, { name: "send_confirmation" }]

4. Streaming Responses

Responses arrive as chunks over Server-Sent Events (SSE):
data: {"content":"Hello"}
data: {"content":" there"}
data: {"content":"!"}

SemanticTest’s Approach

SemanticTest provides 4 specialized approaches for AI testing:

1. Semantic Validation (LLMJudge)

Use AI to judge AI responses:
{
  "block": "LLMJudge",
  "input": {
    "text": "${aiResponse.text}",
    "expected": {
      "expectedBehavior": "Should confirm meeting time and location"
    }
  },
  "output": "judgement"
}
Result:
{
  score: 0.95,
  passed: true,
  reasoning: "Response confirms meeting time (2 PM) and location (Conference Room A)"
}

2. Tool Call Validation

Validate what tools AI calls and with what arguments:
{
  "block": "ValidateTools",
  "input": {
    "from": "aiResponse.toolCalls",
    "as": "toolCalls"
  },
  "config": {
    "expected": ["search_calendar", "create_event"],
    "order": ["search_calendar", "create_event"]
  }
}

3. Streaming Response Parsing

Parse SSE streams from AI APIs:
{
  "block": "StreamParser",
  "input": "${response.body}",
  "config": {
    "format": "sse-openai"
  },
  "output": {
    "text": "aiMessage",
    "toolCalls": "aiTools"
  }
}

4. Multi-Turn Conversation Testing

Test conversational flows with context:
{
  "block": "LLMJudge",
  "input": {
    "text": "${aiResponse.text}",
    "history": [
      { "role": "user", "content": "Hello" },
      { "role": "assistant", "content": "Hi! How can I help?" },
      { "role": "user", "content": "Book a meeting" }
    ],
    "expected": {
      "expectedBehavior": "Should remember context and ask for meeting details"
    }
  }
}

Traditional vs Semantic Testing

  • ❌ Traditional (Brittle)
  • ✅ Semantic (Robust)
{
  "pipeline": [
    {
      "block": "HttpRequest",
      "input": {
        "url": "${AI_API}/chat",
        "method": "POST",
        "body": {
          "message": "Schedule meeting for tomorrow at 2pm"
        }
      },
      "output": "response"
    }
  ],
  "assertions": {
    "response.body": "I've scheduled your meeting for 2:00 PM tomorrow."
  }
}
Problem: Fails if AI says “Done! Meeting set for 2pm tomorrow” or any variation.

What You’ll Learn

Quick Start

1. Install SemanticTest

npm install @blade47/semantic-test

2. Set OpenAI API Key (for LLMJudge)

export OPENAI_API_KEY="sk-..."
Optional: LLMJudge requires OpenAI API key. All other blocks work without it!

3. Create Your First AI Test

test.json
{
  "name": "AI Chat Test",
  "context": {
    "AI_URL": "https://api.openai.com/v1/chat/completions",
    "API_KEY": "${env.OPENAI_API_KEY}"
  },
  "tests": [{
    "id": "test-greeting",
    "pipeline": [
      {
        "block": "HttpRequest",
        "input": {
          "url": "${AI_URL}",
          "method": "POST",
          "headers": {
            "Authorization": "Bearer ${API_KEY}",
            "Content-Type": "application/json"
          },
          "body": {
            "model": "gpt-4o-mini",
            "messages": [
              { "role": "user", "content": "Hello!" }
            ]
          }
        },
        "output": "response"
      },
      {
        "block": "JsonParser",
        "input": "${response.body}",
        "output": { "parsed": "data" }
      },
      {
        "block": "LLMJudge",
        "input": {
          "text": "${data.choices[0].message.content}",
          "expected": {
            "expectedBehavior": "Friendly greeting response"
          }
        },
        "output": "judgement"
      }
    ],
    "assertions": {
      "response.status": 200,
      "judgement.score": { "gt": 0.8 }
    }
  }]
}

4. Run Tests

npx semtest test.json

When to Use Each Approach

Use CaseApproachBlock
Validate response quality/meaningSemantic ValidationLLMJudge
Check exact tools calledTool ValidationValidateTools
Parse streaming AI responsesStream ParsingStreamParser
Test conversation flowsMulti-Turn TestingLLMJudge + history
Verify specific keywordsTraditionalValidateContent
Check response structureTraditionalAssertions

Best Practices

Use structural validation (ValidateTools) + semantic validation (LLMJudge):
{
  "pipeline": [
    { "block": "ValidateTools", "config": { "expected": ["search"] } },
    { "block": "LLMJudge", "input": { "expected": { "expectedBehavior": "Polite response" } } }
  ]
}
Exact assertions are faster and cheaper:
{
  "assertions": {
    "response.status": 200,
    "data.id": { "gt": 0 }
  }
}
Reserve LLMJudge for truly non-deterministic content.
Be specific in LLMJudge expectations:
// ❌ Vague
"expectedBehavior": "Good response"

// ✅ Specific
"expectedBehavior": "Confirms order #12345 is shipped and provides tracking number"
AI systems need edge case testing:
  • Ambiguous user input
  • Missing context
  • Conflicting instructions
  • Tool call failures
  • Stream interruptions

Real-World Examples

Next Steps

Semantic Validation with LLMJudge

Start with semantic validation - the core of AI testing
I