verify(), write test cases as YAML datasets, and run them with the CLI. The framework feeds each dataset case through your evaluators and reports pass/partial/fail verdicts. Nothing in your production workflow changes.
This is how you catch regressions when you update a prompt, swap a model, or refactor a step. Run evals before deploying and you’ll know if quality dropped.
Evaluator Step vs Evaluation Workflow
These serve different purposes and live in different places:| Evaluator Step | Evaluation Workflow | |
|---|---|---|
| Purpose | Control workflow flow at runtime | Test quality across datasets |
| Runs | Inside production workflows | Outside, via CLI |
| Created with | evaluator() from @outputai/core | verify() from @outputai/evals |
| File location | src/workflows/<name>/evaluators.ts | src/workflows/<name>/tests/evals/evaluators.ts |
| Affects output | Yes — retry, branch, gate | No — reports results only |
| Uses | Quality gates, self-correcting agents | Regression testing, prompt comparison, CI/CD |
Directory Structure
Eval files live alongside your workflow in atests/ directory:
tests/evals/workflow.ts) is discovered automatically by the worker alongside your regular workflow.
Writing Evaluators with verify()
verify() creates a typed evaluator that receives the workflow’s input, output, and optional ground truth from the dataset. It wraps evaluator() from @outputai/core, so it integrates with both the eval workflow and the Temporal worker.
tests/evals/evaluators.ts
CheckContext with three fields:
| Field | Type | Description |
|---|---|---|
input | TInput | The workflow input from the dataset |
output | TOutput | The workflow output (cached or freshly executed) |
context.ground_truth | Record<string, unknown> | Ground truth values from the dataset YAML |
input and output schemas are optional — they default to z.any() if omitted. When provided, they give you type safety inside the check function.
Using Ground Truth
Evaluators can read expected values from the dataset’sground_truth field. This lets you define per-case expectations without hardcoding them:
tests/evals/evaluators.ts
min_length value comes from the dataset YAML — see Ground Truth Structure for how this works.
Verdict Helpers
TheVerdict object provides helpers for returning evaluation results. There are three categories:
- Deterministic assertions —
equals,gte,contains,matches, etc. ReturnEvaluationBooleanResultwithconfidence: 1.0. - Manual verdicts —
pass(),partial(),fail(). ReturnEvaluationVerdictResultfor custom logic. - LLM result wrappers —
fromJudge(),score(),label(). Wrap LLM output withconfidence: 0.9.
tests/evals/evaluators.ts
LLM Judge Functions
For subjective evaluation — “is this blog post on-topic?”, “rate the quality” — use the judge functions. They load a.prompt file, call the LLM, and return a typed evaluation result.
Four judge functions are available:
| Function | Returns | Use for |
|---|---|---|
judgeVerdict | pass / partial / fail | Subjective quality gates |
judgeScore | number | Numeric ratings |
judgeBoolean | true / false | Binary subjective checks |
judgeLabel | string | Classifications |
tests/evals/evaluators.ts
JudgeArgs object:
| Field | Type | Description |
|---|---|---|
prompt | string | Prompt filename (e.g., 'judge_topic@v1') |
variables | Record<string, string | number | boolean> | Template variables |
schema | ZodType | Optional custom output schema (overrides the default) |
Writing Judge Prompts
Judge.prompt files live in the same tests/evals/ directory as your evaluators. Each judge function expects a specific output schema from the LLM:
| Function | Expected JSON fields |
|---|---|
judgeVerdict | { verdict: 'pass'|'partial'|'fail', reasoning: string } |
judgeScore | { score: number, reasoning: string } |
judgeBoolean | { result: boolean, reasoning: string } |
judgeLabel | { label: string, reasoning: string } |
tests/evals/judge_topic@v1.prompt
temperature: 0 for judge prompts — you want consistency, not creativity. For best practices on writing effective judge prompts, see LLM-as-a-Judge Best Practices.
LLM Result Wrappers
If you’re calling the LLM yourself instead of using the judge functions, theVerdict object provides wrappers to convert raw LLM output into evaluation results (with confidence 0.9). See LLM Result Wrappers for the full reference.
What’s Next
- Verdict Helpers — Complete reference for all Verdict methods with signatures and examples
- Datasets — Defining test cases with inputs and ground truth
- Running Eval Workflows — Wire evaluators into an eval workflow and run from the CLI
- LLM-as-a-Judge Best Practices — Writing effective judge prompts and choosing grading scales
@outputai/evalsAPI Reference — Complete package reference