Skip to main content
Evaluators score content — usually LLM output — and return results with a value, a confidence score, and optional reasoning. Output uses evaluators in two distinct ways: Evaluator Step — An evaluator that runs inside your workflow as a step. Your workflow generates something, evaluates it, and decides what to do next: retry if quality is low, skip if confidence is high, or branch to a different path. This generate-evaluate-retry loop is how you build self-correcting workflows.
workflow.ts
const summary = await generateSummary( company );
const quality = await judgeSummaryQuality( { summary, companyName: company.name } );

if ( quality.value === true && quality.confidence >= 0.7 ) {
  return summary;
}
// retry or take a different path...
Evaluation Workflow — A separate workflow that tests another workflow’s quality across a dataset of test cases. You define evaluators with verify(), wire them into an eval workflow, and run them from the CLI. Use this for regression testing, CI/CD quality gates, and systematic quality monitoring — without modifying your production workflow code.
npx output workflow test my_workflow --dataset golden_set

Which one do I need?

I want to…Use
Have my workflow check and improve its own outputEvaluator Step
Test my workflow against a set of known inputsEvaluation Workflow
Add quality gates that retry on failureEvaluator Step
Run evals in CI/CD before deployingEvaluation Workflow
Use both in the same projectStart with evaluator steps in your workflow, then add an evaluation workflow for testing
Both approaches use evaluators under the hood — the difference is where they run and what they control.

What’s Next

Evaluator Step

Build evaluators and use them inside your workflows for self-correction

Evaluation Workflow

Test workflow quality across datasets from the CLI

LLM-as-a-Judge Best Practices

Writing effective judge prompts, grading scales, and common pitfalls