AI Chat Test - SemanticTest

Overview

This example demonstrates how to test AI chat responses using LLMJudge for semantic validation. Instead of exact text matching, we validate that responses meet behavioral expectations.

Complete Test

test.json

{
  "name": "AI Chat Quality Tests",
  "version": "1.0.0",
  "tests": [
    {
      "id": "meeting-scheduler",
      "name": "Test Meeting Scheduler Response",
      "pipeline": [
        {
          "id": "mock-ai-response",
          "block": "MockData",
          "config": {
            "data": {
              "text": "I've successfully scheduled your team meeting for tomorrow at 2 PM in Conference Room A. The meeting has been added to your calendar and invitations have been sent to john@example.com.",
              "toolCalls": [
                {
                  "name": "check_user_conflicts",
                  "args": {
                    "date": "2024-01-20",
                    "startTime": "14:00",
                    "endTime": "15:00"
                  }
                },
                {
                  "name": "create_event",
                  "args": {
                    "summary": "Team Meeting",
                    "start": "2024-01-20T14:00:00",
                    "end": "2024-01-20T15:00:00",
                    "location": "Conference Room A",
                    "attendees": ["john@example.com"]
                  }
                }
              ]
            }
          },
          "output": "aiResponse"
        },
        {
          "id": "judge-response",
          "block": "LLMJudge",
          "input": {
            "text": "${aiResponse.text}",
            "toolCalls": "${aiResponse.toolCalls}",
            "expected": {
              "expectedBehavior": "The AI should check for conflicts using check_user_conflicts, then create a meeting using create_event. The response should confirm the meeting is scheduled for tomorrow at 2 PM in Conference Room A and mention that invitations were sent."
            }
          },
          "config": {
            "model": "gpt-4o-mini",
            "criteria": {
              "accuracy": 0.4,
              "completeness": 0.3,
              "relevance": 0.3
            }
          },
          "output": "judgement"
        }
      ],
      "assertions": {
        "judgement.score": { "gt": 0.7 }
      }
    },
    {
      "id": "bad-response-detection",
      "name": "Detect Poor AI Response",
      "pipeline": [
        {
          "id": "mock-bad-response",
          "block": "MockData",
          "config": {
            "data": {
              "text": "I cannot help with that request.",
              "toolCalls": []
            }
          },
          "output": "badResponse"
        },
        {
          "id": "judge-bad",
          "block": "LLMJudge",
          "input": {
            "text": "${badResponse.text}",
            "toolCalls": "${badResponse.toolCalls}",
            "expected": {
              "expectedBehavior": "The AI should check for conflicts and create a meeting with confirmation. It should mention scheduling details and confirm the action was successful."
            }
          },
          "config": {
            "model": "gpt-4o-mini"
          },
          "output": "badJudgement"
        }
      ],
      "assertions": {
        "badJudgement.score": { "lt": 0.4 }
      }
    }
  ]
}

Setup

You’ll need an OpenAI API key:

.env

OPENAI_API_KEY=sk-proj-your-key-here

Running the Test

npx semtest test.json

Expected output:

✅ AI Chat Quality Tests
  ✅ meeting-scheduler: Test Meeting Scheduler Response (1.2s)
     ✅ judgement.score = 0.92 (> 0.7)
     💡 AI used correct tools and provided complete confirmation

  ✅ bad-response-detection: Detect Poor AI Response (0.8s)
     ✅ badJudgement.score = 0.15 (< 0.4)
     💡 Response failed to perform required actions

2 tests passed, 0 failed (2.0s total)

Breaking It Down

Good Response Test

Mock AI Response

Simulate a successful AI response with tool calls:

{
  "block": "MockData",
  "config": {
    "data": {
      "text": "I've successfully scheduled...",
      "toolCalls": [...]
    }
  }
}

Semantic Validation

Use LLMJudge to validate the response semantically:

{
  "block": "LLMJudge",
  "input": {
    "text": "${aiResponse.text}",
    "toolCalls": "${aiResponse.toolCalls}",
    "expected": {
      "expectedBehavior": "Should check conflicts and create meeting..."
    }
  }
}

Assert High Score

Verify the AI response meets quality threshold:

{
  "assertions": {
    "judgement.score": { "gt": 0.7 }
  }
}

Bad Response Detection

The second test verifies that poor responses get low scores, ensuring your LLMJudge criteria are working correctly:

{
  "text": "I cannot help with that request.",
  "toolCalls": []
}

This incomplete response should score below 0.4 because it:

Doesn’t use required tools
Doesn’t provide requested information
Fails to complete the task

Custom Criteria

Adjust scoring weights based on what matters most:

{
  "config": {
    "criteria": {
      "accuracy": 0.5,      // Most important: is it correct?
      "completeness": 0.3,  // Did it cover everything?
      "relevance": 0.2      // Is it on-topic?
    }
  }
}

Real-World Usage

Testing OpenAI Chat API

{
  "pipeline": [
    {
      "block": "HttpRequest",
      "input": {
        "url": "https://api.openai.com/v1/chat/completions",
        "method": "POST",
        "headers": {
          "Authorization": "Bearer ${env.OPENAI_API_KEY}",
          "Content-Type": "application/json"
        },
        "body": {
          "model": "gpt-4o-mini",
          "messages": [
            {
              "role": "user",
              "content": "Schedule a team meeting for tomorrow at 2pm"
            }
          ],
          "tools": [...]
        }
      },
      "output": "response"
    },
    {
      "block": "JsonParser",
      "input": "${response.body}",
      "output": { "parsed": "aiData" }
    },
    {
      "block": "LLMJudge",
      "input": {
        "text": "${aiData.choices[0].message.content}",
        "toolCalls": "${aiData.choices[0].message.tool_calls}",
        "expected": {
          "expectedBehavior": "Should confirm meeting is scheduled..."
        }
      },
      "output": "validation"
    }
  ]
}

Cost Considerations

LLMJudge uses GPT-4o-mini by default, which costs approximately:

~$0.0001 per evaluation
~$0.001 for 10 tests
~$0.10 for 1000 tests

For production CI/CD, consider:

Testing a sample of responses (not every one)
Using faster traditional validation where possible
Running full semantic tests nightly

Next Steps

Semantic Validation

Deep dive into LLMJudge

Tool Call Validation

Validate AI tool usage

Calendar Agent

Complete calendar agent example

LLMJudge Block

Complete LLMJudge documentation

Examples

​Overview

​Complete Test

​Setup

​Running the Test

​Breaking It Down

​Good Response Test

​Bad Response Detection

​Custom Criteria

​Real-World Usage

​Testing OpenAI Chat API

​Cost Considerations

​Next Steps

Semantic Validation

Tool Call Validation

Calendar Agent

LLMJudge Block

Overview

Complete Test

Setup

Running the Test

Breaking It Down

Good Response Test

Bad Response Detection

Custom Criteria

Real-World Usage

Testing OpenAI Chat API

Cost Considerations

Next Steps