evaluator(). This guide covers how to design judge prompts, pick grading scales, avoid common pitfalls, and structure evaluators that give your workflows useful signal.
Start with Failure Modes, Not Metrics
The most common mistake is building evaluators before understanding what actually goes wrong. Don’t start with “I need a quality score.” Start by looking at 20-50 real outputs from your workflow and identifying specific failure patterns. For a lead enrichment summary, you might find:- The summary hallucinates details not in the source data
- It’s too generic — could describe any company
- It misses the prospect’s industry or key product
- It’s too long and buries the useful information
evaluators.ts
Prefer Pass/Fail Over Numeric Scales
When an LLM rates something 6 out of 10, what does that mean? Is it good enough? Almost good? The number sounds precise but it’s actually vague — the LLM isn’t calibrated, and a 6 from one run might be a 7 from another. Binary pass/fail forces clarity. Either the summary mentions the company’s industry or it doesn’t. Either the email has a clear call-to-action or it doesn’t. This makes your evaluators more consistent and your workflow logic simpler.pass, borderline, fail. This gives you a middle ground without the false precision of 1-10 scores.
evaluators.ts
Writing Judge Prompts
The judge prompt is everything. A vague prompt produces vague evaluations. Here’s how to write prompts that give consistent, useful results.Be Specific About Criteria
Don’t ask “is this good?” — define what “good” means for your use case. Vague (unreliable):judge_summary@v1.prompt
judge_summary@v1.prompt
Ask for Reasoning Before the Verdict
LLMs produce better judgments when they think through the evaluation before committing to a score. Structure your schema so the reasoning comes first.Use Temperature 0
Judge prompts should usetemperature: 0 for maximum consistency. You want the same input to produce the same evaluation every time. Save creativity for the generation step, not the judging step.
Define Each Label
If you’re using categories likepass/borderline/fail, define exactly what each one means in the prompt:
One Evaluator, One Concern
Don’t build a single evaluator that judges accuracy, tone, length, and relevance all at once. Split them up. Each evaluator should check one specific thing. This matters for two reasons:- Debugging — When a summary fails, you know why it failed. “Failed hallucination check” is actionable. “Got a quality score of 4” is not.
- Flow control — Different failures need different responses. A hallucinated summary should be regenerated. A summary that’s too short might just need a different prompt variant.
workflow.ts
Layer Deterministic Checks Before LLM Judges
LLM judges cost money and take time. Don’t burn an LLM call on something you can check with code. Run cheap deterministic evaluators first, and only call the LLM judge if the basic checks pass.workflow.ts
Evaluators in Agent Loops
The generate-evaluate-retry pattern is the foundation of agentic workflows. An agent that can check its own work and improve is fundamentally more reliable than one that generates a single output and hopes for the best. The pattern scales beyond simple retry loops. You can use evaluator results to:- Switch strategies — if a summary keeps failing the hallucination check, try a different prompt or a more conservative model
- Accumulate context — pass the evaluator’s reasoning back into the next generation as feedback
- Gate progression — only move to the next stage of a pipeline when the current output meets quality thresholds
workflow.ts
Weighted Rubric Evaluators
Sometimes you need more nuance than a single pass/fail but want to avoid the inconsistency of asking an LLM for a numeric score. The solution: ask the LLM a series of specific yes/no questions, then compute a weighted score from the boolean answers yourself. This gives you the best of both worlds — the LLM makes simple binary decisions it’s good at, and you control the scoring logic deterministically.evaluators.ts
1.0 because the scoring logic itself is deterministic — it’s the LLM’s yes/no answers that carry the uncertainty, and those are simple enough to be reliable. The weights and criteria live in your code, so you can tune them without touching the prompt.
Use the score in your workflow to set thresholds:
Common Pitfalls
Using the Same Model to Generate and Judge
If Claude generates a summary and Claude judges it, there’s a self-enhancement bias — the model tends to rate its own output more favorably. When possible, use a different model for judging, or at minimum validate your judge against human-labeled examples.Trusting High Pass Rates
If your evaluator passes 100% of outputs, it’s not checking hard enough. A useful evaluator should catch real failures. If everything passes, either your generator is perfect (unlikely) or your judge criteria are too loose.Numeric Scales Without Clear Anchors
If you must use numeric scores, anchor every point on the scale with a concrete example or definition. “7 out of 10” means nothing without context. “7 = meets all criteria but lacks specific financial data” is actionable.Evaluating Too Many Things at Once
A prompt that asks “rate this on accuracy, tone, completeness, and relevance” will produce muddled scores. The LLM trades off attention between criteria and the results become inconsistent. Split into separate evaluators.Skipping the Manual Review Phase
Before trusting an evaluator in production, run it against 30-50 examples and compare its judgments to your own. If you disagree with the evaluator more than 10-15% of the time, the judge prompt needs work.Checklist
When building a new evaluator:- Identify the failure mode — What specific problem are you catching?
- Can you check it with code? — If yes, write a deterministic evaluator first
- Define clear criteria — Write them down before writing the prompt
- Pick the simplest scale — Pass/fail first, ternary if needed, numeric only as a last resort
- Ask for reasoning first — Structure the schema so the LLM thinks before judging
- Use temperature 0 — Consistency over creativity for judges
- Validate against your own judgment — Run 30-50 examples and check alignment
- Test in your workflow — Does the evaluator’s signal actually improve the output?