Semantic Validation - SemanticTest

What is Semantic Validation?

Semantic validation means checking if a response is correct in meaning, not just in exact wording.

The Problem

// User asks: "Schedule a meeting for tomorrow at 2pm"

// All these responses are semantically correct:
"✅ I've scheduled your meeting for 2:00 PM tomorrow"
"✅ Done! Your 2pm meeting is set for tomorrow"
"✅ Meeting booked: Tomorrow at 14:00"
"✅ Your meeting tomorrow at 2 is confirmed"

// Traditional exact matching fails on all but one ❌

The Solution: LLMJudge

LLMJudge uses OpenAI models to evaluate whether responses meet your semantic requirements:

{
  "block": "LLMJudge",
  "input": {
    "text": "${aiResponse.text}",
    "expected": {
      "expectedBehavior": "Confirms meeting scheduled for 2pm tomorrow"
    }
  },
  "output": "judgement"
}

Output:

{
  score: 0.95,
  passed: true,
  reasoning: "Response clearly confirms meeting time (2pm) and date (tomorrow)"
}

When to Use LLMJudge

✅ Use LLMJudge
❌ Don't Use LLMJudge

Non-deterministic text content:

AI chat responses
Generated summaries
Creative content
Explanations
Natural language descriptions

Example:

{
  "block": "LLMJudge",
  "input": {
    "text": "${aiResponse}",
    "expected": {
      "expectedBehavior": "Explains the concept clearly in simple terms"
    }
  }
}

Setup

1. Get OpenAI API Key

2. Set Environment Variable

export OPENAI_API_KEY="sk-..."

Or use a .env file:

# .env
OPENAI_API_KEY=sk-...

Cost: LLMJudge uses GPT-4o-mini by default (~

0.0001 per evaluation). A typical test with 10 AI responses costs ~

0.001.

Basic Usage

Simple Validation

{
  "pipeline": [
    {
      "block": "HttpRequest",
      "input": {
        "url": "${AI_API}/chat",
        "method": "POST",
        "body": {
          "message": "What's the capital of France?"
        }
      },
      "output": "response"
    },
    {
      "block": "JsonParser",
      "input": "${response.body}",
      "output": { "parsed": "data" }
    },
    {
      "block": "LLMJudge",
      "input": {
        "text": "${data.message}",
        "expected": {
          "expectedBehavior": "States that Paris is the capital of France"
        }
      },
      "output": "judgement"
    }
  ],
  "assertions": {
    "judgement.score": { "gt": 0.8 }
  }
}

Expected Behavior Format

Be Specific

❌ Vague
✅ Specific

{
  "expectedBehavior": "Good response"
}

Too vague - what makes a response “good”?

Include Key Details

{
  "expectedBehavior": "Explains the difference between REST and GraphQL APIs, mentioning at least: request/response structure, over-fetching, and type safety"
}

Use Action Verbs

Good verbs: confirms, explains, lists, mentions, asks, provides, includes, summarizes

{
  "expectedBehavior": "Lists 3-5 main advantages of TypeScript over JavaScript"
}

Custom Scoring Criteria

Define your own scoring dimensions:

{
  "block": "LLMJudge",
  "input": {
    "text": "${aiResponse}",
    "expected": {
      "expectedBehavior": "Helpful customer service response"
    }
  },
  "config": {
    "criteria": {
      "accuracy": 0.5,
      "completeness": 0.3,
      "relevance": 0.2
    }
  },
  "output": "judgement"
}

Output:

{
  score: 0.87,
  reasoning: "Response is accurate and complete...",
  details: {
    accuracy: 0.9,
    completeness: 0.85,
    relevance: 0.87
  }
}

Weights must sum to 1.0

Available Criteria

LLMJudge supports three scoring criteria. You can adjust their weights:

{
  "config": {
    "criteria": {
      "accuracy": 0.4,      // How correct is the response? (default: 0.4)
      "completeness": 0.3,  // Does it cover all required points? (default: 0.3)
      "relevance": 0.3      // Is it relevant to the question? (default: 0.3)
    }
  }
}

Weights must sum to 1.0. The three criteria are hardcoded and cannot be customized, but you can adjust their relative importance by changing the weights.

Example Weight Adjustments

Prioritize Accuracy
Balanced
Prioritize Completeness

{
  "criteria": {
    "accuracy": 0.6,
    "completeness": 0.2,
    "relevance": 0.2
  }
}

Use when correctness is most important (e.g., factual questions).

Model Selection

Choose the right model for your use case:

{
  "config": {
    "model": "gpt-4o-mini"  // or "gpt-4o", "gpt-4-turbo"
  }
}

Model Comparison

Model	Speed	Cost	Accuracy	Use Case
gpt-4o-mini	⚡ Fast	💰 $0.0001	⭐⭐⭐ Good	Recommended - Fast, cheap, accurate
gpt-4o	⚡ Fast	💰💰 $0.0025	⭐⭐⭐⭐ Better	Complex evaluations
gpt-4-turbo	🐌 Slower	💰💰💰 $0.01	⭐⭐⭐⭐⭐ Best	Critical validations

Start with gpt-4o-mini. It’s 25x cheaper than gpt-4-turbo and handles 95% of use cases well.

Real-World Examples

1. Testing ChatGPT-Style Interface

{
  "name": "Chatbot Quality Test",
  "context": {
    "CHAT_URL": "${env.CHAT_API_URL}"
  },
  "tests": [{
    "id": "test-helpful-response",
    "pipeline": [
      {
        "block": "HttpRequest",
        "input": {
          "url": "${CHAT_URL}/chat",
          "method": "POST",
          "body": {
            "message": "How do I reset my password?"
          }
        },
        "output": "response"
      },
      {
        "block": "JsonParser",
        "input": "${response.body}",
        "output": { "parsed": "data" }
      },
      {
        "block": "LLMJudge",
        "input": {
          "text": "${data.reply}",
          "expected": {
            "expectedBehavior": "Provides clear step-by-step instructions to reset password, mentioning the settings or account page, being polite and helpful, and not asking for current password in chat"
          }
        },
        "config": {
          "criteria": {
            "accuracy": 0.4,
            "completeness": 0.4,
            "relevance": 0.2
          }
        },
        "output": "judgement"
      }
    ],
    "assertions": {
      "response.status": 200,
      "judgement.score": { "gte": 0.8 },
      "judgement.details.completeness": { "gte": 0.7 }
    }
  }]
}

2. Testing Content Summarization

{
  "tests": [{
    "id": "test-article-summary",
    "pipeline": [
      {
        "block": "HttpRequest",
        "input": {
          "url": "${AI_API}/summarize",
          "method": "POST",
          "body": {
            "text": "Long article text here...",
            "maxLength": 100
          }
        },
        "output": "response"
      },
      {
        "block": "JsonParser",
        "input": "${response.body}",
        "output": { "parsed": "data" }
      },
      {
        "block": "ValidateContent",
        "input": {
          "from": "data.summary",
          "as": "text"
        },
        "config": {
          "maxLength": 120
        },
        "output": "lengthCheck"
      },
      {
        "block": "LLMJudge",
        "input": {
          "text": "${data.summary}",
          "expected": {
            "expectedBehavior": "Captures the main points: company revenue growth, new product launch, and market expansion plans"
          }
        },
        "output": "judgement"
      }
    ],
    "assertions": {
      "lengthCheck.passed": true,
      "judgement.score": { "gt": 0.85 }
    }
  }]
}

3. Testing Code Generation

{
  "tests": [{
    "id": "test-code-explanation",
    "pipeline": [
      {
        "block": "HttpRequest",
        "input": {
          "url": "${AI_API}/explain-code",
          "method": "POST",
          "body": {
            "code": "const sum = arr => arr.reduce((a,b) => a+b, 0)",
            "language": "javascript"
          }
        },
        "output": "response"
      },
      {
        "block": "JsonParser",
        "input": "${response.body}",
        "output": { "parsed": "data" }
      },
      {
        "block": "LLMJudge",
        "input": {
          "text": "${data.explanation}",
          "expected": {
            "expectedBehavior": "Explains that the code sums array elements using reduce, mentioning the accumulator pattern in a beginner-friendly way, explaining the reduce function and the initial value (0)"
          }
        },
        "config": {
          "criteria": {
            "accuracy": 0.5,
            "completeness": 0.3,
            "relevance": 0.2
          }
        },
        "output": "judgement"
      }
    ],
    "assertions": {
      "judgement.details.accuracy": { "gte": 0.9 }
    }
  }]
}

Cost Optimization

LLMJudge makes API calls to OpenAI. Here’s how to minimize costs:

1. Use gpt-4o-mini

{
  "config": {
    "model": "gpt-4o-mini"  // 25x cheaper than gpt-4-turbo
  }
}

Savings: ~

0.99 per 100 evaluations vs

25 with gpt-4-turbo

2. Combine with Traditional Validation

Don’t use LLMJudge for everything:

{
  "pipeline": [
    // Fast, free structural checks first
    {
      "block": "ValidateContent",
      "config": {
        "minLength": 10,
        "contains": ["order", "confirmed"]
      }
    },
    // Then semantic validation
    {
      "block": "LLMJudge",
      "input": {
        "expected": {
          "expectedBehavior": "Professional and reassuring tone"
        }
      }
    }
  ]
}

3. Cache Common Validations

If testing same AI responses repeatedly during development:

# Run once, save results
npx semtest tests.json --html --output results.html

# Review results.html instead of re-running

4. Test Samples in CI/CD

{
  "tests": [
    {
      "id": "quick-smoke-test",
      "pipeline": [
        // Test 1 representative example
      ]
    }
  ]
}

Run full suite less frequently (nightly builds).

Best Practices

1. Be Specific in Expected Behavior

// ❌ Too vague
"expectedBehavior": "Good answer"

// ✅ Specific and measurable
"expectedBehavior": "Lists 3-5 benefits of exercise, mentions both physical and mental health"

2. Be Specific About Requirements

{
  "expected": {
    "expectedBehavior": "Provides medical information including a disclaimer to consult a doctor. Must not diagnose conditions or prescribe medications, only provide general health information."
  }
}

3. Set Appropriate Score Thresholds

{
  "assertions": {
    // Strict for critical features
    "criticalCheck.score": { "gte": 0.95 },

    // Lenient for nice-to-haves
    "toneCheck.score": { "gte": 0.7 }
  }
}

4. Combine Multiple Validation Types

{
  "pipeline": [
    // Structure check (free, fast)
    { "block": "ValidateContent", "config": { "contains": "order" } },

    // Tool validation (free, fast)
    { "block": "ValidateTools", "config": { "expected": ["send_email"] } },

    // Semantic validation (costs $, slower)
    { "block": "LLMJudge", "input": { "expected": {...} } }
  ]
}

5. Include Context When Needed

For conversations, pass history:

{
  "block": "LLMJudge",
  "input": {
    "text": "${aiResponse}",
    "history": [
      { "role": "user", "content": "I'm looking for a laptop" },
      { "role": "assistant", "content": "What's your budget?" },
      { "role": "user", "content": "Around $1000" }
    ],
    "expected": {
      "expectedBehavior": "Recommends laptops in $800-1200 range"
    }
  }
}

Troubleshooting

Low Scores Despite Good Responses

Problem: judgement.score is 0.6 but response looks good Solutions:

Make expectedBehavior more specific
Lower score threshold: { "gte": 0.7 } instead of { "gte": 0.9 }
Check if response actually meets expectations

High Costs

Problem: Test suite costs $5+ per run Solutions:

Switch to gpt-4o-mini
Use traditional validation where possible
Test fewer examples in CI
Cache results during development

Inconsistent Scores

Problem: Same response gets different scores (0.85, then 0.78, then 0.92) Solution: This is normal with AI. Set threshold ranges:

{
  "assertions": {
    "judgement.score": { "gte": 0.75 }  // Allow some variance
  }
}

Missing Context

Problem: LLMJudge doesn’t understand multi-turn conversation Solution: Pass conversation history:

{
  "input": {
    "text": "${aiResponse}",
    "history": [...previous messages...]
  }
}

Next Steps

Tool Call Validation

Learn to validate AI tool/function calls

Streaming Responses

Parse and test SSE streams from AI APIs

LLMJudge Block Reference

Complete LLMJudge documentation

AI Chat Example

Full example testing OpenAI chat API

Get Started

Core Concepts

Blocks

Testing AI Systems

Advanced

​What is Semantic Validation?

​The Problem

​The Solution: LLMJudge

​When to Use LLMJudge

​Setup

​1. Get OpenAI API Key

​2. Set Environment Variable

​Basic Usage

​Simple Validation

​Expected Behavior Format

​Be Specific

​Include Key Details

​Use Action Verbs

​Custom Scoring Criteria

​Available Criteria

​Example Weight Adjustments

​Model Selection

​Model Comparison

​Real-World Examples

​1. Testing ChatGPT-Style Interface

​2. Testing Content Summarization

​3. Testing Code Generation

​Cost Optimization

​1. Use gpt-4o-mini

​2. Combine with Traditional Validation

​3. Cache Common Validations

​4. Test Samples in CI/CD

​Best Practices

​Troubleshooting

​Low Scores Despite Good Responses

​High Costs

​Inconsistent Scores

​Missing Context

​Next Steps

Tool Call Validation

Streaming Responses

LLMJudge Block Reference

AI Chat Example

What is Semantic Validation?

The Problem

The Solution: LLMJudge

When to Use LLMJudge

Setup

1. Get OpenAI API Key

2. Set Environment Variable

Basic Usage

Simple Validation

Expected Behavior Format

Be Specific

Include Key Details

Use Action Verbs

Custom Scoring Criteria

Available Criteria

Example Weight Adjustments

Model Selection

Model Comparison

Real-World Examples

1. Testing ChatGPT-Style Interface

2. Testing Content Summarization

3. Testing Code Generation

Cost Optimization

1. Use gpt-4o-mini

2. Combine with Traditional Validation

3. Cache Common Validations

4. Test Samples in CI/CD

Best Practices

Troubleshooting

Low Scores Despite Good Responses

High Costs

Inconsistent Scores

Missing Context

Next Steps