Skip to main content
Evaluators are a special type of step that scores content — usually LLM output. They return an EvaluationResult with a value, a confidence score, and optional reasoning. The real power is using that score in your workflow to control what happens next: retry if quality is low, skip a step if confidence is high, or branch to a different path. This generate-evaluate-retry loop is how you build self-correcting workflows — and it’s a core pattern when building AI agents. Evaluators can be deterministic (rule-based checks) or use an LLM to judge quality. Deterministic evaluators are great for structured checks like length, format, or required fields. But more often than not, you’ll want LLM-as-a-judge — using an LLM to evaluate things that are hard to check with rules, like whether a summary is accurate, an email sounds natural, or a classification makes sense.

Deterministic Evaluator

A simple evaluator that checks whether a company summary meets basic structural requirements:
evaluators.ts
import { evaluator, EvaluationBooleanResult } from '@outputai/core';
import { CheckSummaryStructureInput } from './types.js';

export const checkSummaryStructure = evaluator({
  name: 'checkSummaryStructure',
  description: 'Check if a summary meets minimum structural requirements',
  inputSchema: CheckSummaryStructureInput,
  fn: async (input) => {
    const hasMinLength = input.summary.length >= 100;
    const mentionsCompany = input.summary.toLowerCase().includes(input.companyName.toLowerCase());
    const passes = hasMinLength && mentionsCompany;

    return new EvaluationBooleanResult({
      value: passes,
      confidence: 1.0,
      reasoning: !hasMinLength
        ? 'Summary is too short'
        : !mentionsCompany
        ? 'Summary does not mention the company name'
        : 'Meets structural requirements'
    });
  }
});

// types.ts
// import { z } from '@outputai/core';
//
// export const CheckSummaryStructureInput = z.object({
//   summary: z.string(),
//   companyName: z.string()
// });
Deterministic evaluators are fast and predictable — confidence is always 1.0 because there’s no ambiguity. Use them for checks where the rules are clear-cut.

LLM-as-a-Judge Evaluator

For subjective quality — accuracy, tone, relevance — you need an LLM to evaluate. This is the more common pattern:
evaluators.ts
import { evaluator, EvaluationBooleanResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
import { z } from '@outputai/core';
import { JudgeSummaryInput } from './types.js';

export const judgeSummaryQuality = evaluator({
  name: 'judgeSummaryQuality',
  description: 'Judge whether a company summary is accurate and useful',
  inputSchema: JudgeSummaryInput,
  fn: async (input) => {
    const { output } = await generateText({
      prompt: 'judge_summary@v1',
      variables: {
        summary: input.summary,
        companyName: input.companyName
      },
      output: Output.object({
        schema: z.object({
          reasoning: z.string(),
          passes: z.boolean(),
          confidence: z.number()
        })
      })
    });

    return new EvaluationBooleanResult({
      value: output.passes,
      confidence: output.confidence,
      reasoning: output.reasoning
    });
  }
});

// types.ts
// import { z } from '@outputai/core';
//
// export const JudgeSummaryInput = z.object({
//   summary: z.string(),
//   companyName: z.string()
// });
Evaluators are called from workflows like regular async functions — await judgeSummaryQuality({ summary, companyName }). The workflow decides what to do with the result.

Evaluator Properties

PropertyTypeDescription
namestringUnique identifier for the evaluator
descriptionstringHuman-readable description
inputSchemaZodSchemaZod schema for input validation
optionsobjectOptional workflow options (see Options)
fnfunctionThe evaluator implementation (must return an EvaluationResult)

EvaluationResult Types

Evaluators must return an EvaluationResult. Three types are available depending on what you’re scoring:

EvaluationBooleanResult

  • Pass/fail judgments
  • value: boolean
  • Quality gates, compliance checks

EvaluationNumberResult

  • Numeric scores
  • value: number
  • Ratings (1-10), percentages

EvaluationStringResult

  • Category assignments
  • value: string
  • Classifications, labels
All three types share the same base fields:
FieldRequiredDescription
valueYesThe evaluation result (boolean, number, or string)
confidenceYesConfidence score between 0 and 1
reasoningNoExplanation of the evaluation
nameNoName for this specific result
feedbackNoArray of EvaluationFeedback objects with issues and suggestions
dimensionsNoArray of nested EvaluationResult instances for multi-dimensional scoring

Boolean Evaluator

Pass/fail checks. Use these as quality gates — does this output meet the bar?
evaluators.ts
import { evaluator, EvaluationBooleanResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
import { z } from '@outputai/core';
import { CheckFactualityInput } from './types.js';

export const checkFactuality = evaluator({
  name: 'checkFactuality',
  description: 'Check whether a summary contains only factual claims',
  inputSchema: CheckFactualityInput,
  fn: async (input) => {
    const { output } = await generateText({
      prompt: 'check_factuality@v1',
      variables: {
        summary: input.summary,
        sourceData: input.sourceData
      },
      output: Output.object({
        schema: z.object({
          issues: z.array(z.string()),
          isFactual: z.boolean(),
          confidence: z.number()
        })
      })
    });

    return new EvaluationBooleanResult({
      value: output.isFactual,
      confidence: output.confidence,
      reasoning: output.issues.length > 0
        ? `Issues found: ${output.issues.join(', ')}`
        : 'No factual issues detected'
    });
  }
});

// types.ts
// import { z } from '@outputai/core';
//
// export const CheckFactualityInput = z.object({
//   summary: z.string(),
//   sourceData: z.string()
// });

Number Evaluator

Numeric scores for when you need more granularity than pass/fail. In most cases, pass/fail or a three-tier scale (pass/borderline/fail) gives more consistent results — see Evaluator Best Practices. But numeric scores are useful when you need to rank or compare outputs, or when you have well-defined anchors for each score level.
evaluators.ts
import { evaluator, EvaluationNumberResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
import { z } from '@outputai/core';
import { ScoreEmailInput } from './types.js';

export const scoreEmailDraft = evaluator({
  name: 'scoreEmailDraft',
  description: 'Score a sales email draft on a 1-10 scale',
  inputSchema: ScoreEmailInput,
  fn: async (input) => {
    const { output } = await generateText({
      prompt: 'score_email@v1',
      variables: {
        email: input.emailBody,
        recipientRole: input.recipientRole,
        companyName: input.companyName
      },
      output: Output.object({
        schema: z.object({
          reasoning: z.string(),
          score: z.number().min(1).max(10),
          confidence: z.number()
        })
      })
    });

    return new EvaluationNumberResult({
      value: output.score,
      confidence: output.confidence,
      reasoning: output.reasoning
    });
  }
});

// types.ts
// import { z } from '@outputai/core';
//
// export const ScoreEmailInput = z.object({
//   emailBody: z.string(),
//   recipientRole: z.string(),
//   companyName: z.string()
// });

String Evaluator

Category assignments — classifying content into labels.
evaluators.ts
import { evaluator, EvaluationStringResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
import { z } from '@outputai/core';
import { ClassifyIntentInput } from './types.js';

export const classifyLeadIntent = evaluator({
  name: 'classifyLeadIntent',
  description: 'Classify the buying intent of a lead based on their activity',
  inputSchema: ClassifyIntentInput,
  fn: async (input) => {
    const { output } = await generateText({
      prompt: 'classify_intent@v1',
      variables: {
        recentActivity: input.recentActivity,
        companySize: input.companySize
      },
      output: Output.object({
        schema: z.object({
          reasoning: z.string(),
          intent: z.enum(['high', 'medium', 'low', 'unknown']),
          confidence: z.number()
        })
      })
    });

    return new EvaluationStringResult({
      value: output.intent,
      confidence: output.confidence,
      reasoning: output.reasoning
    });
  }
});

// types.ts
// import { z } from '@outputai/core';
//
// export const ClassifyIntentInput = z.object({
//   recentActivity: z.string(),
//   companySize: z.number()
// });

Feedback and Dimensions

For more detailed evaluations, you can attach feedback (specific issues and suggestions) and dimensions (sub-scores that break down the overall result).
evaluators.ts
import { evaluator, EvaluationStringResult, EvaluationNumberResult, EvaluationFeedback } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
import { z } from '@outputai/core';
import { ReviewProposalInput } from './types.js';

export const reviewProposal = evaluator({
  name: 'reviewProposal',
  description: 'Review a sales proposal across multiple dimensions',
  inputSchema: ReviewProposalInput,
  fn: async (input) => {
    const { output } = await generateText({
      prompt: 'review_proposal@v1',
      variables: { proposal: input.proposal, prospect: input.prospectName },
      output: Output.object({
        schema: z.object({
          reasoning: z.string(),
          overall: z.enum(['ready', 'needs_revision', 'major_issues']),
          confidence: z.number(),
          clarity: z.number().min(0).max(1),
          relevance: z.number().min(0).max(1),
          issues: z.array(z.object({
            issue: z.string(),
            suggestion: z.string(),
            priority: z.enum(['high', 'medium', 'low'])
          }))
        })
      })
    });

    return new EvaluationStringResult({
      value: output.overall,
      confidence: output.confidence,
      dimensions: [
        new EvaluationNumberResult({
          value: output.clarity,
          confidence: output.confidence,
          name: 'clarity'
        }),
        new EvaluationNumberResult({
          value: output.relevance,
          confidence: output.confidence,
          name: 'relevance'
        })
      ],
      feedback: output.issues.map(i => new EvaluationFeedback({
        issue: i.issue,
        suggestion: i.suggestion,
        priority: i.priority
      }))
    });
  }
});

// types.ts
// import { z } from '@outputai/core';
//
// export const ReviewProposalInput = z.object({
//   proposal: z.string(),
//   prospectName: z.string()
// });

Using Evaluators in Workflows

Evaluators are called like regular async functions. The typical pattern is: generate something, evaluate it, then decide what to do based on the score.
workflow.ts
import { workflow } from '@outputai/core';
import { lookupCompany, generateSummary } from './steps.js';
import { judgeSummaryQuality } from './evaluators.js';
import { LeadEnrichmentInput, LeadEnrichmentOutput } from './types.js';

export default workflow({
  name: 'lead_enrichment',
  inputSchema: LeadEnrichmentInput,
  outputSchema: LeadEnrichmentOutput,
  fn: async (input) => {
    const company = await lookupCompany(input.companyDomain);

    let summary;
    let attempts = 0;
    const maxAttempts = 3;

    while (attempts < maxAttempts) {
      summary = await generateSummary(company);
      const quality = await judgeSummaryQuality({
        summary,
        companyName: company.name
      });

      if (quality.value === true && quality.confidence >= 0.7) break;
      attempts++;
    }

    return { company: company.name, summary };
  }
});

// types.ts
// import { z } from '@outputai/core';
//
// export const LeadEnrichmentInput = z.object({
//   companyDomain: z.string()
// });
//
// export const LeadEnrichmentOutput = z.object({
//   company: z.string(),
//   summary: z.string()
// });
This pattern — generate, evaluate, retry if needed — is the core of LLM-as-a-judge. The evaluator gives your workflow a way to check its own work.

Options

Evaluators support the same options as steps for configuring retry behavior and timeouts. See Step Options for the full list.
evaluators.ts
import { evaluator, EvaluationBooleanResult } from '@outputai/core';
import { JudgeSummaryInput } from './types.js';

export const judgeSummaryQuality = evaluator({
  name: 'judgeSummaryQuality',
  description: 'Judge whether a company summary is accurate and useful',
  inputSchema: JudgeSummaryInput,
  options: {
    retry: {
      maximumAttempts: 5,
      initialInterval: '1s',
      backoffCoefficient: 2
    }
  },
  fn: async (input) => {
    // ...
  }
});

Shared Evaluators

When multiple workflows need the same evaluator, put it in src/shared/evaluators/:
src/shared/evaluators/check_factuality.ts
import { evaluator, EvaluationBooleanResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
import { z } from '@outputai/core';
import { CheckFactualityInput } from './types.js';

export const checkFactuality = evaluator({
  name: 'checkFactuality',
  description: 'Check whether content contains only factual claims',
  inputSchema: CheckFactualityInput,
  fn: async (input) => {
    const { output } = await generateText({
      prompt: 'check_factuality@v1',
      variables: { summary: input.summary, sourceData: input.sourceData },
      output: Output.object({
        schema: z.object({
          confidence: z.number(),
          isFactual: z.boolean()
        })
      })
    });

    return new EvaluationBooleanResult({
      value: output.isFactual,
      confidence: output.confidence
    });
  }
});

// types.ts
// import { z } from '@outputai/core';
//
// export const CheckFactualityInput = z.object({
//   summary: z.string(),
//   sourceData: z.string()
// });
Import shared evaluators in any workflow:
workflow.ts
import { workflow } from '@outputai/core';
import { checkFactuality } from '../../shared/evaluators/check_factuality.js';
import { generateSummary, lookupCompany } from './steps.js';
import { LeadEnrichmentInput, LeadEnrichmentOutput } from './types.js';

export default workflow({
  name: 'lead_enrichment',
  inputSchema: LeadEnrichmentInput,
  outputSchema: LeadEnrichmentOutput,
  fn: async (input) => {
    const company = await lookupCompany(input.companyDomain);
    const summary = await generateSummary(company);

    const check = await checkFactuality({
      summary,
      sourceData: JSON.stringify(company)
    });

    return {
      company: company.name,
      summary,
      factualityConfidence: check.confidence
    };
  }
});
Shared evaluators can only be imported by workflows, not by other evaluators or steps. This enforces the activity isolation rule — evaluators are activities and activities can’t call other activities.

Evaluation Workflows

The evaluators on this page run inside your workflows — they power generate-evaluate-retry loops in production. For testing workflow quality across datasets without modifying your workflow code, see Evaluation Workflow. It covers verify() for creating typed evaluators, Verdict helpers for deterministic assertions, LLM judge functions, datasets, and running evals from the CLI.

What’s Next