Build evaluators that score LLM output with confidence levels and use the results to control workflow flow
Evaluators are a special type of step that scores content — usually LLM output. They return an EvaluationResult with a value, a confidence score, and optional reasoning. The real power is using that score in your workflow to control what happens next: retry if quality is low, skip a step if confidence is high, or branch to a different path. This generate-evaluate-retry loop is how you build self-correcting workflows — and it’s a core pattern when building AI agents.Evaluators can be deterministic (rule-based checks) or use an LLM to judge quality. Deterministic evaluators are great for structured checks like length, format, or required fields. But more often than not, you’ll want LLM-as-a-judge — using an LLM to evaluate things that are hard to check with rules, like whether a summary is accurate, an email sounds natural, or a classification makes sense.
A simple evaluator that checks whether a company summary meets basic structural requirements:
evaluators.ts
Copy
Ask AI
import { evaluator, EvaluationBooleanResult } from '@outputai/core';import { CheckSummaryStructureInput } from './types.js';export const checkSummaryStructure = evaluator({ name: 'checkSummaryStructure', description: 'Check if a summary meets minimum structural requirements', inputSchema: CheckSummaryStructureInput, fn: async (input) => { const hasMinLength = input.summary.length >= 100; const mentionsCompany = input.summary.toLowerCase().includes(input.companyName.toLowerCase()); const passes = hasMinLength && mentionsCompany; return new EvaluationBooleanResult({ value: passes, confidence: 1.0, reasoning: !hasMinLength ? 'Summary is too short' : !mentionsCompany ? 'Summary does not mention the company name' : 'Meets structural requirements' }); }});// types.ts// import { z } from '@outputai/core';//// export const CheckSummaryStructureInput = z.object({// summary: z.string(),// companyName: z.string()// });
Deterministic evaluators are fast and predictable — confidence is always 1.0 because there’s no ambiguity. Use them for checks where the rules are clear-cut.
For subjective quality — accuracy, tone, relevance — you need an LLM to evaluate. This is the more common pattern:
evaluators.ts
Copy
Ask AI
import { evaluator, EvaluationBooleanResult } from '@outputai/core';import { generateText, Output } from '@outputai/llm';import { z } from '@outputai/core';import { JudgeSummaryInput } from './types.js';export const judgeSummaryQuality = evaluator({ name: 'judgeSummaryQuality', description: 'Judge whether a company summary is accurate and useful', inputSchema: JudgeSummaryInput, fn: async (input) => { const { output } = await generateText({ prompt: 'judge_summary@v1', variables: { summary: input.summary, companyName: input.companyName }, output: Output.object({ schema: z.object({ reasoning: z.string(), passes: z.boolean(), confidence: z.number() }) }) }); return new EvaluationBooleanResult({ value: output.passes, confidence: output.confidence, reasoning: output.reasoning }); }});// types.ts// import { z } from '@outputai/core';//// export const JudgeSummaryInput = z.object({// summary: z.string(),// companyName: z.string()// });
Evaluators are called from workflows like regular async functions — await judgeSummaryQuality({ summary, companyName }). The workflow decides what to do with the result.
Numeric scores for when you need more granularity than pass/fail. In most cases, pass/fail or a three-tier scale (pass/borderline/fail) gives more consistent results — see Evaluator Best Practices. But numeric scores are useful when you need to rank or compare outputs, or when you have well-defined anchors for each score level.
evaluators.ts
Copy
Ask AI
import { evaluator, EvaluationNumberResult } from '@outputai/core';import { generateText, Output } from '@outputai/llm';import { z } from '@outputai/core';import { ScoreEmailInput } from './types.js';export const scoreEmailDraft = evaluator({ name: 'scoreEmailDraft', description: 'Score a sales email draft on a 1-10 scale', inputSchema: ScoreEmailInput, fn: async (input) => { const { output } = await generateText({ prompt: 'score_email@v1', variables: { email: input.emailBody, recipientRole: input.recipientRole, companyName: input.companyName }, output: Output.object({ schema: z.object({ reasoning: z.string(), score: z.number().min(1).max(10), confidence: z.number() }) }) }); return new EvaluationNumberResult({ value: output.score, confidence: output.confidence, reasoning: output.reasoning }); }});// types.ts// import { z } from '@outputai/core';//// export const ScoreEmailInput = z.object({// emailBody: z.string(),// recipientRole: z.string(),// companyName: z.string()// });
For more detailed evaluations, you can attach feedback (specific issues and suggestions) and dimensions (sub-scores that break down the overall result).
Evaluators are called like regular async functions. The typical pattern is: generate something, evaluate it, then decide what to do based on the score.
Shared evaluators can only be imported by workflows, not by other evaluators or steps. This enforces the activity isolation rule — evaluators are activities and activities can’t call other activities.
The evaluators on this page run inside your workflows — they power generate-evaluate-retry loops in production. For testing workflow quality across datasets without modifying your workflow code, see Evaluation Workflow. It covers verify() for creating typed evaluators, Verdict helpers for deterministic assertions, LLM judge functions, datasets, and running evals from the CLI.
LLM-as-a-Judge Best Practices — Writing effective judge prompts, choosing grading scales, avoiding common pitfalls, and patterns for production evaluators
Evaluation Workflow — Test workflow quality across datasets with verify(), Verdict, and the CLI