LLMJudge

Overview

The LLMJudge block uses AI (OpenAI GPT models) to evaluate responses semantically instead of using exact string matching.

Requires OPENAI_API_KEY environment variable. All other blocks work without it!

Why LLM Judge?

AI responses are non-deterministic. These responses all mean the same thing:

“The meeting is at 2:00 PM”
“Your meeting is scheduled for 2 PM”
“I’ve set your meeting for 14:00”
“Meeting confirmed for two in the afternoon”

Exact string matching would fail on all but one. LLM Judge understands semantic meaning.

Basic Usage

{
  "block": "LLMJudge",
  "input": {
    "text": "${response.text}",
    "expected": {
      "expectedBehavior": "Should confirm the meeting time and location"
    }
  },
  "config": {
    "model": "gpt-4o-mini"
  },
  "output": "judgement"
}

Input Parameters

text

string

required

The AI response text to evaluate

toolCalls

array

Tool calls made by the AI (if any)

expected

object

required

What the AI should do

Show properties

expectedBehavior

string

required

Description of what the AI should accomplish

constraints

string[]

Things the AI must or must not do

"constraints": [
  "Must ask for user confirmation before booking",
  "Must not delete any existing events"
]

history

array

Conversation history for multi-turn conversations

"history": [
  { "role": "user", "content": "Hello" },
  { "role": "assistant", "content": "Hi! How can I help?" },
  { "role": "user", "content": "Book a meeting" }
]

Configuration

model

string

default:"gpt-4o-mini"

OpenAI model to use for judging. Options:

gpt-4o-mini (recommended, fast and cheap)
gpt-4o
gpt-4-turbo

criteria

object

Custom scoring criteria with weights (must sum to 1.0)

"criteria": {
  "accuracy": 0.4,
  "completeness": 0.3,
  "relevance": 0.2,
  "helpfulness": 0.1
}

Default criteria:

accuracy: 0.4
completeness”: 0.3
relevance: 0.3

continueConversation

boolean

default:"false"

Whether the judge should suggest a next prompt for multi-turn testing

maxTurns

number

Max conversation turns when using continueConversation

Output

score

number

Overall score from 0 to 1

reasoning

string

Explanation of the score

shouldContinue

boolean

Whether to continue the conversation (when using continueConversation)

nextPrompt

string

Suggested next user message (when using continueConversation)

details

object

Detailed breakdown by criteria

{
  accuracy: 0.9,
  completeness: 0.8,
  relevance: 0.95
}

Examples

Basic Semantic Validation

{
  "pipeline": [
    {
      "block": "HttpRequest",
      "input": {
        "url": "https://api.example.com/chat",
        "method": "POST",
        "body": {
          "message": "What's the weather like?"
        }
      },
      "output": "response"
    },
    {
      "block": "LLMJudge",
      "input": {
        "text": "${response.body.message}",
        "expected": {
          "expectedBehavior": "Should provide weather information or explain how to get it"
        }
      },
      "config": {
        "model": "gpt-4o-mini"
      },
      "output": "judgement"
    }
  ],
  "assertions": {
    "judgement.score": { "gt": 0.7 }
  }
}

Validating Tool Calls

{
  "block": "LLMJudge",
  "input": {
    "text": "${response.text}",
    "toolCalls": "${response.toolCalls}",
    "expected": {
      "expectedBehavior": "Should search for flights and show results",
      "constraints": [
        "Must use search_flights tool",
        "Must confirm dates before booking",
        "Must not book without user confirmation"
      ]
    }
  },
  "config": {
    "model": "gpt-4o-mini"
  },
  "output": "validation"
}

Multi-turn Conversation

{
  "block": "LLMJudge",
  "input": {
    "text": "${response.text}",
    "toolCalls": "${response.toolCalls}",
    "history": [
      { "role": "user", "content": "I need to schedule a meeting" },
      { "role": "assistant", "content": "I'd be happy to help! What time works for you?" },
      { "role": "user", "content": "Tomorrow at 2 PM" }
    ],
    "expected": {
      "expectedBehavior": "Should confirm the time and ask about attendees or location"
    }
  },
  "config": {
    "model": "gpt-4o-mini",
    "continueConversation": true,
    "maxTurns": 5
  },
  "output": "judgement"
}

If judgement.shouldContinue is true, you can use judgement.nextPrompt to continue testing:

{
  "block": "HttpRequest",
  "input": {
    "url": "https://api.example.com/chat",
    "body": {
      "message": "${judgement.nextPrompt}"
    }
  }
}

Custom Scoring Criteria

{
  "block": "LLMJudge",
  "input": {
    "text": "${response.text}",
    "expected": {
      "expectedBehavior": "Should provide a detailed technical explanation"
    }
  },
  "config": {
    "model": "gpt-4o-mini",
    "criteria": {
      "accuracy": 0.5,        // Most important
      "completeness": 0.3,    // Somewhat important
      "clarity": 0.2          // Less important
    }
  }
}

Cost Considerations

LLM Judge uses OpenAI API calls:

Model	Cost per 1K tokens	Typical cost per evaluation
gpt-4o-mini	$0.00015	~$0.001
gpt-4o	$0.005	~$0.02
gpt-4-turbo	$0.01	~$0.04

Use gpt-4o-mini for most tests. It’s 30x cheaper than GPT-4 and works great for semantic validation.

Best Practices

Be Specific in Expected Behavior

// Good
"expectedBehavior": "Should search for flights from NYC to LAX on March 15th and present top 3 options sorted by price"

// Bad
"expectedBehavior": "Should search flights"

Use Constraints for Important Rules

"expected": {
  "expectedBehavior": "Should book the flight",
  "constraints": [
    "Must confirm price before booking",
    "Must not exceed $500 budget",
    "Must offer travel insurance"
  ]
}

Set Reasonable Score Thresholds

// For critical features
"assertions": {
  "judgement.score": { "gt": 0.8 }
}

// For nice-to-have features
"assertions": {
  "judgement.score": { "gt": 0.6 }
}

Combine with Other Validators

Use LLM Judge for semantic meaning, other validators for structure:

{
  "pipeline": [
    { "block": "ValidateTools" },     // Tool calls are correct
    { "block": "ValidateContent" },   // Response has required text
    { "block": "LLMJudge" }           // Response makes semantic sense
  ]
}

Understanding the Reasoning

The judge provides detailed reasoning:

{
  "score": 0.85,
  "reasoning": "The response correctly identifies the user's intent to book a flight and uses the search_flights tool with appropriate parameters. It confirms the dates and asks for user confirmation before proceeding. However, it could be more specific about the departure time preferences.",
  "details": {
    "accuracy": 0.9,      // AI understood intent correctly
    "completeness": 0.75, // Missing some details
    "relevance": 0.9      // Response is highly relevant
  }
}

Use this to:

Debug why tests fail
Understand AI behavior
Improve your prompts

Next Steps

AI Testing Guide

Complete guide to testing AI systems

ValidateTools Block

Validate AI tool calls

Get Started

Core Concepts

Blocks

Testing AI Systems

Advanced

Overview

Why LLM Judge?

Basic Usage

Input Parameters

Configuration

Output

Examples

Basic Semantic Validation

Validating Tool Calls

Multi-turn Conversation

Custom Scoring Criteria

Cost Considerations

Best Practices

Understanding the Reasoning

Next Steps

AI Testing Guide

ValidateTools Block

Get Started

Core Concepts

Blocks

Testing AI Systems

Advanced

​Overview

​Why LLM Judge?

​Basic Usage

​Input Parameters

​Configuration

​Output

​Examples

​Basic Semantic Validation

​Validating Tool Calls

​Multi-turn Conversation

​Custom Scoring Criteria

​Cost Considerations

​Best Practices

​Understanding the Reasoning

​Next Steps

AI Testing Guide

ValidateTools Block

Overview

Why LLM Judge?

Basic Usage

Input Parameters

Configuration

Output

Examples

Basic Semantic Validation

Validating Tool Calls

Multi-turn Conversation

Custom Scoring Criteria

Cost Considerations

Best Practices

Understanding the Reasoning

Next Steps