verify() and Verdict, you need an eval workflow that ties them together. The eval workflow defines which evaluators to run, how to interpret their results, and whether each one is required or informational.
Creating an Eval Workflow
evalWorkflow() connects your evaluators and tells the framework how to interpret their results. The eval workflow file lives at tests/evals/workflow.ts inside your workflow directory.
A minimal example with one evaluator:
tests/evals/workflow.ts
tests/evals/workflow.ts
evals array has three fields:
| Field | Type | Default | Description |
|---|---|---|---|
evaluator | Function | — | An evaluator created with verify() |
criticality | 'required' | 'informational' | 'required' | Whether failure should fail the case |
interpret | InterpretConfig | — | How to convert the raw result to a verdict |
Criticality
required(default): If this evaluator fails, the entire case fails. Use for checks that gate quality — topic relevance, minimum length, factual accuracy.informational: Failure is reported but doesn’t affect the case verdict. Use for metrics you want to track without gating on — tone classification, style scores, auxiliary checks.
Interpret Types
Your evaluators return raw values (booleans, numbers, strings, verdicts). Theinterpret config tells the framework how to convert those into pass/partial/fail:
| Type | Config | Pass | Partial | Fail |
|---|---|---|---|---|
boolean | { type: 'boolean' } | value === true | — | value === false |
verdict | { type: 'verdict' } | value === 'pass' | value === 'partial' | value === 'fail' |
number | { type: 'number', pass: 0.7, partial: 0.4 } | value >= 0.7 | value >= 0.4 | value < 0.4 |
string | { type: 'string', pass: ['a', 'b'], partial: ['c'] } | value in ['a', 'b'] | value in ['c'] | otherwise |
partial threshold is optional for both number and string types — omit it to have only pass and fail.
Case Verdict Aggregation
Each dataset case runs all evaluators. The case-level verdict follows these rules:- If any required evaluator fails, the case fails
- Else if any required evaluator is partial, the case is partial
- Otherwise, the case passes
Running Evals from the CLI
Theoutput workflow test command runs your eval workflow against datasets.
Common Commands
Flags
| Flag | Default | Description |
|---|---|---|
--cached | false | Use cached output from last_output in datasets, skip workflow execution |
--save | false | Run workflow fresh and save output/eval results back to dataset files |
--dataset | all | Comma-separated list of dataset names to run |
--format | text | Output format (text or json) |
--cached during development when iterating on evaluators — it’s fast because it skips the workflow entirely. Use --save when you want to capture fresh output and eval results.
Putting It All Together
Here’s the complete setup for a blog generator workflow: 1. Write evaluators — mix deterministic checks and LLM judges:tests/evals/evaluators.ts
tests/evals/workflow.ts
tests/datasets/stripe_blog.yml
What’s Next
- Workflow Evaluators — Writing evaluators with
verify(), Verdict helpers, and judge functions - Verdict Helpers — Complete reference for deterministic assertions and manual verdicts
- Datasets — Defining test cases with inputs and ground truth
- LLM-as-a-Judge Best Practices — Writing effective judge prompts and choosing grading scales
@outputai/evalsAPI Reference — Complete package reference- CLI Commands — Full CLI reference for eval and dataset commands