Skip to main content

What is Semantic Validation?

Semantic validation means checking if a response is correct in meaning, not just in exact wording.

The Problem

// User asks: "Schedule a meeting for tomorrow at 2pm"

// All these responses are semantically correct:
"✅ I've scheduled your meeting for 2:00 PM tomorrow"
"✅ Done! Your 2pm meeting is set for tomorrow"
"✅ Meeting booked: Tomorrow at 14:00"
"✅ Your meeting tomorrow at 2 is confirmed"

// Traditional exact matching fails on all but one ❌

The Solution: LLMJudge

LLMJudge uses OpenAI models to evaluate whether responses meet your semantic requirements:
{
  "block": "LLMJudge",
  "input": {
    "text": "${aiResponse.text}",
    "expected": {
      "expectedBehavior": "Confirms meeting scheduled for 2pm tomorrow"
    }
  },
  "output": "judgement"
}
Output:
{
  score: 0.95,
  passed: true,
  reasoning: "Response clearly confirms meeting time (2pm) and date (tomorrow)"
}

When to Use LLMJudge

  • ✅ Use LLMJudge
  • ❌ Don't Use LLMJudge
Non-deterministic text content:
  • AI chat responses
  • Generated summaries
  • Creative content
  • Explanations
  • Natural language descriptions
Example:
{
  "block": "LLMJudge",
  "input": {
    "text": "${aiResponse}",
    "expected": {
      "expectedBehavior": "Explains the concept clearly in simple terms"
    }
  }
}

Setup

1. Get OpenAI API Key

Sign up at platform.openai.com and create an API key.

2. Set Environment Variable

export OPENAI_API_KEY="sk-..."
Or use a .env file:
# .env
OPENAI_API_KEY=sk-...
Cost: LLMJudge uses GPT-4o-mini by default (~0.0001perevaluation).Atypicaltestwith10AIresponsescosts 0.0001 per evaluation). A typical test with 10 AI responses costs ~0.001.

Basic Usage

Simple Validation

{
  "pipeline": [
    {
      "block": "HttpRequest",
      "input": {
        "url": "${AI_API}/chat",
        "method": "POST",
        "body": {
          "message": "What's the capital of France?"
        }
      },
      "output": "response"
    },
    {
      "block": "JsonParser",
      "input": "${response.body}",
      "output": { "parsed": "data" }
    },
    {
      "block": "LLMJudge",
      "input": {
        "text": "${data.message}",
        "expected": {
          "expectedBehavior": "States that Paris is the capital of France"
        }
      },
      "output": "judgement"
    }
  ],
  "assertions": {
    "judgement.score": { "gt": 0.8 }
  }
}

Expected Behavior Format

Be Specific

  • ❌ Vague
  • ✅ Specific
{
  "expectedBehavior": "Good response"
}
Too vague - what makes a response “good”?

Include Key Details

{
  "expectedBehavior": "Explains the difference between REST and GraphQL APIs, mentioning at least: request/response structure, over-fetching, and type safety"
}

Use Action Verbs

Good verbs: confirms, explains, lists, mentions, asks, provides, includes, summarizes
{
  "expectedBehavior": "Lists 3-5 main advantages of TypeScript over JavaScript"
}

Custom Scoring Criteria

Define your own scoring dimensions:
{
  "block": "LLMJudge",
  "input": {
    "text": "${aiResponse}",
    "expected": {
      "expectedBehavior": "Helpful customer service response"
    }
  },
  "config": {
    "criteria": {
      "accuracy": 0.5,
      "completeness": 0.3,
      "relevance": 0.2
    }
  },
  "output": "judgement"
}
Output:
{
  score: 0.87,
  reasoning: "Response is accurate and complete...",
  details: {
    accuracy: 0.9,
    completeness: 0.85,
    relevance: 0.87
  }
}
Weights must sum to 1.0

Available Criteria

LLMJudge supports three scoring criteria. You can adjust their weights:
{
  "config": {
    "criteria": {
      "accuracy": 0.4,      // How correct is the response? (default: 0.4)
      "completeness": 0.3,  // Does it cover all required points? (default: 0.3)
      "relevance": 0.3      // Is it relevant to the question? (default: 0.3)
    }
  }
}
Weights must sum to 1.0. The three criteria are hardcoded and cannot be customized, but you can adjust their relative importance by changing the weights.

Example Weight Adjustments

  • Prioritize Accuracy
  • Balanced
  • Prioritize Completeness
{
  "criteria": {
    "accuracy": 0.6,
    "completeness": 0.2,
    "relevance": 0.2
  }
}
Use when correctness is most important (e.g., factual questions).

Model Selection

Choose the right model for your use case:
{
  "config": {
    "model": "gpt-4o-mini"  // or "gpt-4o", "gpt-4-turbo"
  }
}

Model Comparison

ModelSpeedCostAccuracyUse Case
gpt-4o-mini⚡ Fast💰 $0.0001⭐⭐⭐ GoodRecommended - Fast, cheap, accurate
gpt-4o⚡ Fast💰💰 $0.0025⭐⭐⭐⭐ BetterComplex evaluations
gpt-4-turbo🐌 Slower💰💰💰 $0.01⭐⭐⭐⭐⭐ BestCritical validations
Start with gpt-4o-mini. It’s 25x cheaper than gpt-4-turbo and handles 95% of use cases well.

Real-World Examples

1. Testing ChatGPT-Style Interface

{
  "name": "Chatbot Quality Test",
  "context": {
    "CHAT_URL": "${env.CHAT_API_URL}"
  },
  "tests": [{
    "id": "test-helpful-response",
    "pipeline": [
      {
        "block": "HttpRequest",
        "input": {
          "url": "${CHAT_URL}/chat",
          "method": "POST",
          "body": {
            "message": "How do I reset my password?"
          }
        },
        "output": "response"
      },
      {
        "block": "JsonParser",
        "input": "${response.body}",
        "output": { "parsed": "data" }
      },
      {
        "block": "LLMJudge",
        "input": {
          "text": "${data.reply}",
          "expected": {
            "expectedBehavior": "Provides clear step-by-step instructions to reset password, mentioning the settings or account page, being polite and helpful, and not asking for current password in chat"
          }
        },
        "config": {
          "criteria": {
            "accuracy": 0.4,
            "completeness": 0.4,
            "relevance": 0.2
          }
        },
        "output": "judgement"
      }
    ],
    "assertions": {
      "response.status": 200,
      "judgement.score": { "gte": 0.8 },
      "judgement.details.completeness": { "gte": 0.7 }
    }
  }]
}

2. Testing Content Summarization

{
  "tests": [{
    "id": "test-article-summary",
    "pipeline": [
      {
        "block": "HttpRequest",
        "input": {
          "url": "${AI_API}/summarize",
          "method": "POST",
          "body": {
            "text": "Long article text here...",
            "maxLength": 100
          }
        },
        "output": "response"
      },
      {
        "block": "JsonParser",
        "input": "${response.body}",
        "output": { "parsed": "data" }
      },
      {
        "block": "ValidateContent",
        "input": {
          "from": "data.summary",
          "as": "text"
        },
        "config": {
          "maxLength": 120
        },
        "output": "lengthCheck"
      },
      {
        "block": "LLMJudge",
        "input": {
          "text": "${data.summary}",
          "expected": {
            "expectedBehavior": "Captures the main points: company revenue growth, new product launch, and market expansion plans"
          }
        },
        "output": "judgement"
      }
    ],
    "assertions": {
      "lengthCheck.passed": true,
      "judgement.score": { "gt": 0.85 }
    }
  }]
}

3. Testing Code Generation

{
  "tests": [{
    "id": "test-code-explanation",
    "pipeline": [
      {
        "block": "HttpRequest",
        "input": {
          "url": "${AI_API}/explain-code",
          "method": "POST",
          "body": {
            "code": "const sum = arr => arr.reduce((a,b) => a+b, 0)",
            "language": "javascript"
          }
        },
        "output": "response"
      },
      {
        "block": "JsonParser",
        "input": "${response.body}",
        "output": { "parsed": "data" }
      },
      {
        "block": "LLMJudge",
        "input": {
          "text": "${data.explanation}",
          "expected": {
            "expectedBehavior": "Explains that the code sums array elements using reduce, mentioning the accumulator pattern in a beginner-friendly way, explaining the reduce function and the initial value (0)"
          }
        },
        "config": {
          "criteria": {
            "accuracy": 0.5,
            "completeness": 0.3,
            "relevance": 0.2
          }
        },
        "output": "judgement"
      }
    ],
    "assertions": {
      "judgement.details.accuracy": { "gte": 0.9 }
    }
  }]
}

Cost Optimization

LLMJudge makes API calls to OpenAI. Here’s how to minimize costs:

1. Use gpt-4o-mini

{
  "config": {
    "model": "gpt-4o-mini"  // 25x cheaper than gpt-4-turbo
  }
}
Savings: ~0.99per100evaluationsvs0.99 per 100 evaluations vs 25 with gpt-4-turbo

2. Combine with Traditional Validation

Don’t use LLMJudge for everything:
{
  "pipeline": [
    // Fast, free structural checks first
    {
      "block": "ValidateContent",
      "config": {
        "minLength": 10,
        "contains": ["order", "confirmed"]
      }
    },
    // Then semantic validation
    {
      "block": "LLMJudge",
      "input": {
        "expected": {
          "expectedBehavior": "Professional and reassuring tone"
        }
      }
    }
  ]
}

3. Cache Common Validations

If testing same AI responses repeatedly during development:
# Run once, save results
npx semtest tests.json --html --output results.html

# Review results.html instead of re-running

4. Test Samples in CI/CD

{
  "tests": [
    {
      "id": "quick-smoke-test",
      "pipeline": [
        // Test 1 representative example
      ]
    }
  ]
}
Run full suite less frequently (nightly builds).

Best Practices

// ❌ Too vague
"expectedBehavior": "Good answer"

// ✅ Specific and measurable
"expectedBehavior": "Lists 3-5 benefits of exercise, mentions both physical and mental health"
{
  "expected": {
    "expectedBehavior": "Provides medical information including a disclaimer to consult a doctor. Must not diagnose conditions or prescribe medications, only provide general health information."
  }
}
{
  "assertions": {
    // Strict for critical features
    "criticalCheck.score": { "gte": 0.95 },

    // Lenient for nice-to-haves
    "toneCheck.score": { "gte": 0.7 }
  }
}
{
  "pipeline": [
    // Structure check (free, fast)
    { "block": "ValidateContent", "config": { "contains": "order" } },

    // Tool validation (free, fast)
    { "block": "ValidateTools", "config": { "expected": ["send_email"] } },

    // Semantic validation (costs $, slower)
    { "block": "LLMJudge", "input": { "expected": {...} } }
  ]
}
For conversations, pass history:
{
  "block": "LLMJudge",
  "input": {
    "text": "${aiResponse}",
    "history": [
      { "role": "user", "content": "I'm looking for a laptop" },
      { "role": "assistant", "content": "What's your budget?" },
      { "role": "user", "content": "Around $1000" }
    ],
    "expected": {
      "expectedBehavior": "Recommends laptops in $800-1200 range"
    }
  }
}

Troubleshooting

Low Scores Despite Good Responses

Problem: judgement.score is 0.6 but response looks good Solutions:
  1. Make expectedBehavior more specific
  2. Lower score threshold: { "gte": 0.7 } instead of { "gte": 0.9 }
  3. Check if response actually meets expectations

High Costs

Problem: Test suite costs $5+ per run Solutions:
  1. Switch to gpt-4o-mini
  2. Use traditional validation where possible
  3. Test fewer examples in CI
  4. Cache results during development

Inconsistent Scores

Problem: Same response gets different scores (0.85, then 0.78, then 0.92) Solution: This is normal with AI. Set threshold ranges:
{
  "assertions": {
    "judgement.score": { "gte": 0.75 }  // Allow some variance
  }
}

Missing Context

Problem: LLMJudge doesn’t understand multi-turn conversation Solution: Pass conversation history:
{
  "input": {
    "text": "${aiResponse}",
    "history": [...previous messages...]
  }
}

Next Steps

I