Testing AI Systems

The Challenge of Testing AI

Traditional API testing uses exact matching:

{
  "assertions": {
    "response.message": "User created successfully"
  }
}

This breaks immediately with AI responses. Consider asking an AI to schedule a meeting: Expected: “I’ve scheduled your meeting for 2 PM on Tuesday.” Actual responses:

✅ “Your meeting is set for Tuesday at 2:00 PM”
✅ “Done! Meeting scheduled for 14:00 on Tuesday”
✅ “I’ve booked your Tuesday 2pm slot”
❌ All fail with exact matching!

Why AI Testing is Different

1. Non-Deterministic Responses

Same input → different outputs every time:

Input: "Summarize this article"
Output 1: "This article discusses..."
Output 2: "The piece explores..."
Output 3: "Key points include..."

2. Semantic Equivalence

These mean the same thing:

“Cannot find user”
“User not found”
“That user doesn’t exist”
“I couldn’t locate that user”

3. Tool/Function Calling

AI agents call tools unpredictably:

// Both valid for "book a meeting"
[{ name: "search_calendar" }, { name: "create_event" }]
[{ name: "check_availability" }, { name: "create_event" }, { name: "send_confirmation" }]

4. Streaming Responses

Responses arrive as chunks over Server-Sent Events (SSE):

data: {"content":"Hello"}
data: {"content":" there"}
data: {"content":"!"}

SemanticTest’s Approach

SemanticTest provides 4 specialized approaches for AI testing:

1. Semantic Validation (LLMJudge)

Use AI to judge AI responses:

{
  "block": "LLMJudge",
  "input": {
    "text": "${aiResponse.text}",
    "expected": {
      "expectedBehavior": "Should confirm meeting time and location"
    }
  },
  "output": "judgement"
}

Result:

{
  score: 0.95,
  passed: true,
  reasoning: "Response confirms meeting time (2 PM) and location (Conference Room A)"
}

2. Tool Call Validation

Validate what tools AI calls and with what arguments:

{
  "block": "ValidateTools",
  "input": {
    "from": "aiResponse.toolCalls",
    "as": "toolCalls"
  },
  "config": {
    "expected": ["search_calendar", "create_event"],
    "order": ["search_calendar", "create_event"]
  }
}

3. Streaming Response Parsing

Parse SSE streams from AI APIs:

{
  "block": "StreamParser",
  "input": "${response.body}",
  "config": {
    "format": "sse-openai"
  },
  "output": {
    "text": "aiMessage",
    "toolCalls": "aiTools"
  }
}

4. Multi-Turn Conversation Testing

Test conversational flows with context:

{
  "block": "LLMJudge",
  "input": {
    "text": "${aiResponse.text}",
    "history": [
      { "role": "user", "content": "Hello" },
      { "role": "assistant", "content": "Hi! How can I help?" },
      { "role": "user", "content": "Book a meeting" }
    ],
    "expected": {
      "expectedBehavior": "Should remember context and ask for meeting details"
    }
  }
}

Traditional vs Semantic Testing

❌ Traditional (Brittle)
✅ Semantic (Robust)

{
  "pipeline": [
    {
      "block": "HttpRequest",
      "input": {
        "url": "${AI_API}/chat",
        "method": "POST",
        "body": {
          "message": "Schedule meeting for tomorrow at 2pm"
        }
      },
      "output": "response"
    }
  ],
  "assertions": {
    "response.body": "I've scheduled your meeting for 2:00 PM tomorrow."
  }
}

Problem: Fails if AI says “Done! Meeting set for 2pm tomorrow” or any variation.

What You’ll Learn

Semantic Validation

Use LLMJudge to validate AI responses semantically instead of exact matches

Tool Call Validation

Test AI agents that call tools/functions with ValidateTools

Streaming Responses

Parse and test Server-Sent Events (SSE) streams from AI APIs

Multi-Turn Conversations

Test conversational AI with context and memory

Quick Start

1. Install SemanticTest

npm install @blade47/semantic-test

2. Set OpenAI API Key (for LLMJudge)

export OPENAI_API_KEY="sk-..."

Optional: LLMJudge requires OpenAI API key. All other blocks work without it!

3. Create Your First AI Test

test.json

{
  "name": "AI Chat Test",
  "context": {
    "AI_URL": "https://api.openai.com/v1/chat/completions",
    "API_KEY": "${env.OPENAI_API_KEY}"
  },
  "tests": [{
    "id": "test-greeting",
    "pipeline": [
      {
        "block": "HttpRequest",
        "input": {
          "url": "${AI_URL}",
          "method": "POST",
          "headers": {
            "Authorization": "Bearer ${API_KEY}",
            "Content-Type": "application/json"
          },
          "body": {
            "model": "gpt-4o-mini",
            "messages": [
              { "role": "user", "content": "Hello!" }
            ]
          }
        },
        "output": "response"
      },
      {
        "block": "JsonParser",
        "input": "${response.body}",
        "output": { "parsed": "data" }
      },
      {
        "block": "LLMJudge",
        "input": {
          "text": "${data.choices[0].message.content}",
          "expected": {
            "expectedBehavior": "Friendly greeting response"
          }
        },
        "output": "judgement"
      }
    ],
    "assertions": {
      "response.status": 200,
      "judgement.score": { "gt": 0.8 }
    }
  }]
}

4. Run Tests

npx semtest test.json

When to Use Each Approach

Use Case	Approach	Block
Validate response quality/meaning	Semantic Validation	LLMJudge
Check exact tools called	Tool Validation	ValidateTools
Parse streaming AI responses	Stream Parsing	StreamParser
Test conversation flows	Multi-Turn Testing	LLMJudge + history
Verify specific keywords	Traditional	ValidateContent
Check response structure	Traditional	Assertions

Best Practices

Combine Multiple Validation Approaches

Use structural validation (ValidateTools) + semantic validation (LLMJudge):

{
  "pipeline": [
    { "block": "ValidateTools", "config": { "expected": ["search"] } },
    { "block": "LLMJudge", "input": { "expected": { "expectedBehavior": "Polite response" } } }
  ]
}

Use Traditional Assertions Where Possible

Exact assertions are faster and cheaper:

{
  "assertions": {
    "response.status": 200,
    "data.id": { "gt": 0 }
  }
}

Reserve LLMJudge for truly non-deterministic content.

Set Clear Expected Behavior

Be specific in LLMJudge expectations:

// ❌ Vague
"expectedBehavior": "Good response"

// ✅ Specific
"expectedBehavior": "Confirms order #12345 is shipped and provides tracking number"

Test Edge Cases

AI systems need edge case testing:

Ambiguous user input
Missing context
Conflicting instructions
Tool call failures
Stream interruptions

Real-World Examples

AI Chat API

Complete example testing OpenAI chat API

Calendar Agent

Test an AI agent that manages calendar events

Basic API Test

Traditional API testing patterns

Error Handling

Handle errors and retries

Next Steps

Semantic Validation with LLMJudge

Start with semantic validation - the core of AI testing

Get Started

Core Concepts

Blocks

Advanced

Testing AI Systems

The Challenge of Testing AI

Why AI Testing is Different

1. Non-Deterministic Responses

2. Semantic Equivalence

3. Tool/Function Calling

4. Streaming Responses

SemanticTest’s Approach

1. Semantic Validation (LLMJudge)

2. Tool Call Validation

3. Streaming Response Parsing

4. Multi-Turn Conversation Testing

Traditional vs Semantic Testing

What You’ll Learn

Semantic Validation

Tool Call Validation

Streaming Responses

Multi-Turn Conversations

Quick Start

1. Install SemanticTest

2. Set OpenAI API Key (for LLMJudge)

3. Create Your First AI Test

4. Run Tests

When to Use Each Approach

Best Practices

Real-World Examples

AI Chat API

Calendar Agent

Basic API Test

Error Handling

Next Steps

Semantic Validation with LLMJudge

Get Started

Core Concepts

Blocks

Testing AI Systems

Advanced

​The Challenge of Testing AI

​Why AI Testing is Different

​1. Non-Deterministic Responses

​2. Semantic Equivalence

​3. Tool/Function Calling

​4. Streaming Responses

​SemanticTest’s Approach

​1. Semantic Validation (LLMJudge)

​2. Tool Call Validation

​3. Streaming Response Parsing

​4. Multi-Turn Conversation Testing

​Traditional vs Semantic Testing

​What You’ll Learn

Semantic Validation

Tool Call Validation

Streaming Responses

Multi-Turn Conversations

​Quick Start

​1. Install SemanticTest

​2. Set OpenAI API Key (for LLMJudge)

​3. Create Your First AI Test

​4. Run Tests

​When to Use Each Approach

​Best Practices

​Real-World Examples

AI Chat API

Calendar Agent

Basic API Test

Error Handling

​Next Steps

Semantic Validation with LLMJudge

The Challenge of Testing AI

Why AI Testing is Different

1. Non-Deterministic Responses

2. Semantic Equivalence

3. Tool/Function Calling

4. Streaming Responses

SemanticTest’s Approach

1. Semantic Validation (LLMJudge)

2. Tool Call Validation

3. Streaming Response Parsing

4. Multi-Turn Conversation Testing

Traditional vs Semantic Testing

What You’ll Learn

Quick Start

1. Install SemanticTest

2. Set OpenAI API Key (for LLMJudge)

3. Create Your First AI Test

4. Run Tests

When to Use Each Approach

Best Practices

Real-World Examples

Next Steps