{ "block": "LLMJudge", "input": { "text": "${response.text}", "toolCalls": "${response.toolCalls}", "history": [ { "role": "user", "content": "I need to schedule a meeting" }, { "role": "assistant", "content": "I'd be happy to help! What time works for you?" }, { "role": "user", "content": "Tomorrow at 2 PM" } ], "expected": { "expectedBehavior": "Should confirm the time and ask about attendees or location" } }, "config": { "model": "gpt-4o-mini", "continueConversation": true, "maxTurns": 5 }, "output": "judgement"}
If judgement.shouldContinue is true, you can use judgement.nextPrompt to continue testing:
// Good"expectedBehavior": "Should search for flights from NYC to LAX on March 15th and present top 3 options sorted by price"// Bad"expectedBehavior": "Should search flights"
Use Constraints for Important Rules
Copy
Ask AI
"expected": { "expectedBehavior": "Should book the flight", "constraints": [ "Must confirm price before booking", "Must not exceed $500 budget", "Must offer travel insurance" ]}
Set Reasonable Score Thresholds
Copy
Ask AI
// For critical features"assertions": { "judgement.score": { "gt": 0.8 }}// For nice-to-have features"assertions": { "judgement.score": { "gt": 0.6 }}
Combine with Other Validators
Use LLM Judge for semantic meaning, other validators for structure:
Copy
Ask AI
{ "pipeline": [ { "block": "ValidateTools" }, // Tool calls are correct { "block": "ValidateContent" }, // Response has required text { "block": "LLMJudge" } // Response makes semantic sense ]}
{ "score": 0.85, "reasoning": "The response correctly identifies the user's intent to book a flight and uses the search_flights tool with appropriate parameters. It confirms the dates and asks for user confirmation before proceeding. However, it could be more specific about the departure time preferences.", "details": { "accuracy": 0.9, // AI understood intent correctly "completeness": 0.75, // Missing some details "relevance": 0.9 // Response is highly relevant }}