Skip to main content
Your workflows have evaluator steps that power generate-evaluate-retry loops in production. Evaluation workflows answer a different question: “across a set of known inputs, does my workflow still produce acceptable output?” Workflow evaluators run outside your workflow. You define evaluators with verify(), write test cases as YAML datasets, and run them with the CLI. The framework feeds each dataset case through your evaluators and reports pass/partial/fail verdicts. Nothing in your production workflow changes. This is how you catch regressions when you update a prompt, swap a model, or refactor a step. Run evals before deploying and you’ll know if quality dropped.

Evaluator Step vs Evaluation Workflow

These serve different purposes and live in different places:
Evaluator StepEvaluation Workflow
PurposeControl workflow flow at runtimeTest quality across datasets
RunsInside production workflowsOutside, via CLI
Created withevaluator() from @outputai/coreverify() from @outputai/evals
File locationsrc/workflows/<name>/evaluators.tssrc/workflows/<name>/tests/evals/evaluators.ts
Affects outputYes — retry, branch, gateNo — reports results only
UsesQuality gates, self-correcting agentsRegression testing, prompt comparison, CI/CD
Evaluator steps are documented in Evaluator Step. This section covers evaluation workflows.

Directory Structure

Eval files live alongside your workflow in a tests/ directory:
src/workflows/
  blog_generator/
    workflow.ts
    steps.ts
    evaluators.ts              # Inline evaluators (production)
    types.ts
    prompts/
      generate_blog@v1.prompt
    tests/
      evals/
        workflow.ts            # evalWorkflow() definition
        evaluators.ts          # Workflow evaluators (verify + Verdict)
        judge_topic@v1.prompt  # Judge prompt files
        judge_quality@v1.prompt
      datasets/
        happy_path.yml         # Test cases
        edge_case.yml
The eval workflow file (tests/evals/workflow.ts) is discovered automatically by the worker alongside your regular workflow.

Writing Evaluators with verify()

verify() creates a typed evaluator that receives the workflow’s input, output, and optional ground truth from the dataset. It wraps evaluator() from @outputai/core, so it integrates with both the eval workflow and the Temporal worker.
tests/evals/evaluators.ts
import { verify, Verdict } from '@outputai/evals';
import { z } from '@outputai/core';

export const evaluateSum = verify(
  {
    name: 'evaluate_sum',
    input: z.object( { values: z.array( z.number() ) } ),
    output: z.object( { result: z.number() } )
  },
  ( { input, output } ) =>
    Verdict.equals( output.result, input.values.reduce( ( a, b ) => a + b, 0 ) )
);
The check function receives a CheckContext with three fields:
FieldTypeDescription
inputTInputThe workflow input from the dataset
outputTOutputThe workflow output (cached or freshly executed)
context.ground_truthRecord<string, unknown>Ground truth values from the dataset YAML
The input and output schemas are optional — they default to z.any() if omitted. When provided, they give you type safety inside the check function.

Using Ground Truth

Evaluators can read expected values from the dataset’s ground_truth field. This lets you define per-case expectations without hardcoding them:
tests/evals/evaluators.ts
import { verify, Verdict } from '@outputai/evals';
import { z } from '@outputai/core';

export const lengthOfOutput = verify(
  {
    name: 'length_of_output',
    input: z.object( { topic: z.string() } ),
    output: z.object( { title: z.string(), blog_post: z.string() } )
  },
  ( { output, context } ) =>
    Verdict.gte( output.blog_post.length, Number( context.ground_truth.min_length ?? 100 ) )
);
The min_length value comes from the dataset YAML — see Ground Truth Structure for how this works.

Verdict Helpers

The Verdict object provides helpers for returning evaluation results. There are three categories:
  • Deterministic assertionsequals, gte, contains, matches, etc. Return EvaluationBooleanResult with confidence: 1.0.
  • Manual verdictspass(), partial(), fail(). Return EvaluationVerdictResult for custom logic.
  • LLM result wrappersfromJudge(), score(), label(). Wrap LLM output with confidence: 0.9.
tests/evals/evaluators.ts
import { verify, Verdict } from '@outputai/evals';
import { z } from '@outputai/core';

export const evaluateContent = verify(
  {
    name: 'evaluate_content',
    input: z.object( { topic: z.string() } ),
    output: z.object( { title: z.string(), blog_post: z.string() } )
  },
  ( { output, context } ) => {
    const required = String( context.ground_truth.required_content ?? '' );
    if ( !required ) {
      return Verdict.isTrue( true );
    }
    return Verdict.contains( output.blog_post, required );
  }
);
For the complete reference with signatures, examples, and edge cases for every Verdict method, see Verdict Helpers.

LLM Judge Functions

For subjective evaluation — “is this blog post on-topic?”, “rate the quality” — use the judge functions. They load a .prompt file, call the LLM, and return a typed evaluation result. Four judge functions are available:
FunctionReturnsUse for
judgeVerdictpass / partial / failSubjective quality gates
judgeScorenumberNumeric ratings
judgeBooleantrue / falseBinary subjective checks
judgeLabelstringClassifications
tests/evals/evaluators.ts
import { verify, judgeVerdict, judgeScore, judgeLabel } from '@outputai/evals';
import { z } from '@outputai/core';

const blogInput = z.object( { topic: z.string() } );
const blogOutput = z.object( { title: z.string(), blog_post: z.string() } );

export const evaluateTopic = verify(
  { name: 'evaluate_topic', input: blogInput, output: blogOutput },
  async ( { input, output, context } ) =>
    judgeVerdict( {
      prompt: 'judge_topic@v1',
      variables: {
        blog_title: output.title,
        blog_post: output.blog_post,
        required_topic: String( context.ground_truth.required_topic ?? input.topic )
      }
    } )
);

export const evaluateQuality = verify(
  { name: 'evaluate_quality', input: blogInput, output: blogOutput },
  async ( { input, output } ) =>
    judgeScore( {
      prompt: 'judge_quality@v1',
      variables: {
        blog_title: output.title,
        blog_post: output.blog_post,
        topic: input.topic
      }
    } )
);

export const evaluateTone = verify(
  { name: 'evaluate_tone', input: blogInput, output: blogOutput },
  async ( { output } ) =>
    judgeLabel( {
      prompt: 'judge_tone@v1',
      variables: {
        blog_title: output.title,
        blog_post: output.blog_post
      }
    } )
);
All judge functions accept a JudgeArgs object:
FieldTypeDescription
promptstringPrompt filename (e.g., 'judge_topic@v1')
variablesRecord<string, string | number | boolean>Template variables
schemaZodTypeOptional custom output schema (overrides the default)

Writing Judge Prompts

Judge .prompt files live in the same tests/evals/ directory as your evaluators. Each judge function expects a specific output schema from the LLM:
FunctionExpected JSON fields
judgeVerdict{ verdict: 'pass'|'partial'|'fail', reasoning: string }
judgeScore{ score: number, reasoning: string }
judgeBoolean{ result: boolean, reasoning: string }
judgeLabel{ label: string, reasoning: string }
Here’s a judge prompt for topic evaluation:
tests/evals/judge_topic@v1.prompt
---
provider: anthropic
model: claude-haiku-4-5-20251001
temperature: 0
maxTokens: 1000
---

<system>
You are an evaluation judge. Assess whether a blog post is faithfully about the required topic.

Return a JSON object with:
- verdict: "pass" if the blog clearly focuses on the topic, "partial" if it mentions the topic but lacks depth, "fail" if it is not about the topic
- reasoning: a brief explanation of your judgment
</system>

<user>
Required topic: {{ required_topic }}

Blog title: {{ blog_title }}

Blog post:
{{ blog_post }}

Judge whether this blog post is faithfully about the required topic.
</user>
Use temperature: 0 for judge prompts — you want consistency, not creativity. For best practices on writing effective judge prompts, see LLM-as-a-Judge Best Practices.

LLM Result Wrappers

If you’re calling the LLM yourself instead of using the judge functions, the Verdict object provides wrappers to convert raw LLM output into evaluation results (with confidence 0.9). See LLM Result Wrappers for the full reference.

What’s Next