@outputai/evals package lets you test workflow quality across datasets — without modifying your workflow code. You define evaluators with verify(), write datasets in YAML, and run them with the CLI. Each dataset case feeds a saved workflow input/output pair through your evaluators, and the framework reports pass/partial/fail verdicts per case.
This is the complement to evaluator steps that run inside workflows. Those evaluators power generate-evaluate-retry loops in production. Evaluation workflows answer a different question: “across a set of known inputs, does my workflow still produce acceptable output?” For the full guide, see Evaluation Workflow.
What’s in the Package
| Export | Description |
|---|---|
evalWorkflow | Define an eval workflow that tests datasets against evaluators |
verify | Create a typed evaluator with Zod schemas for input/output |
Verdict | Deterministic assertion helpers (equals, gte, contains, etc.) and LLM result wrappers |
judgeVerdict | LLM judge that returns pass/partial/fail |
judgeScore | LLM judge that returns a numeric score |
judgeBoolean | LLM judge that returns true/false |
judgeLabel | LLM judge that returns a string label |
interpretResult | Convert an evaluator result to a pass/partial/fail verdict |
aggregateCaseVerdict | Combine multiple evaluator results into a single case verdict |
renderEvalOutput | Format eval results for CLI output |
computeExitCode | Return 1 if any case failed, 0 otherwise |
Creating Evaluators with verify()
verify() creates a typed evaluator that receives the workflow’s input, output, and optional ground truth from the dataset. It wraps evaluator() from @outputai/core so it integrates with both the eval workflow and the Temporal worker.
input and output schemas are optional — they default to z.any() if omitted. The check function receives a CheckContext:
| Field | Type | Description |
|---|---|---|
input | TInput | The workflow input from the dataset |
output | TOutput | The workflow output (cached or freshly executed) |
context.ground_truth | Record<string, unknown> | Ground truth values from the dataset YAML |
Basic Example
A deterministic evaluator that checks a sum calculation:tests/evals/evaluators.ts
Ground Truth Example
Evaluators can read per-evaluator ground truth from the dataset. The framework merges global ground truth with evaluator-specific overrides:tests/evals/evaluators.ts
tests/datasets/stripe_blog.yml
notes) are available to all evaluators. Fields under evals.<evaluator_name> are merged in for that specific evaluator, overriding globals with the same key.
Verdict Helpers
TheVerdict object provides deterministic assertion helpers and LLM result wrappers. All deterministic helpers return results with confidence 1.0.
Deterministic Assertions
| Helper | Arguments | Passes when |
|---|---|---|
Verdict.equals(actual, expected) | any, any | actual === expected |
Verdict.closeTo(actual, expected, tolerance) | number, number, number | |actual - expected| <= tolerance |
Verdict.gt(actual, threshold) | number, number | actual > threshold |
Verdict.gte(actual, threshold) | number, number | actual >= threshold |
Verdict.lt(actual, threshold) | number, number | actual < threshold |
Verdict.lte(actual, threshold) | number, number | actual <= threshold |
Verdict.inRange(actual, min, max) | number, number, number | min <= actual <= max |
Verdict.contains(haystack, needle) | string, string | haystack.includes(needle) |
Verdict.matches(value, pattern) | string, RegExp | pattern.test(value) |
Verdict.includesAll(actual, expected) | array, array | actual contains every element of expected |
Verdict.includesAny(actual, expected) | array, array | actual contains at least one element of expected |
Verdict.isTrue(value) | boolean | value === true |
Verdict.isFalse(value) | boolean | value === false |
Manual Verdicts
| Helper | Arguments | Result |
|---|---|---|
Verdict.pass(reasoning?) | string? | Pass with confidence 1.0 |
Verdict.partial(confidence, reasoning?, feedback?) | number, string?, FeedbackArg[]? | Partial with custom confidence |
Verdict.fail(reasoning, feedback?) | string, FeedbackArg[]? | Fail with confidence 0.0 |
LLM Result Wrappers
These wrap LLM judge output into evaluation results with confidence0.9:
| Helper | Arguments | Result type |
|---|---|---|
Verdict.fromJudge({ verdict, reasoning }) | object | Verdict (pass/partial/fail) |
Verdict.score(value, reasoning?) | number, string? | Number |
Verdict.label(value, reasoning?) | string, string? | String |
LLM Judge Functions
For subjective evaluation — “is this blog post on-topic?”, “rate the quality 0-100” — use the judge functions. They load a.prompt file, call the LLM, and return a typed evaluation result.
tests/evals/evaluators.ts
| Function | Expected schema | Returns |
|---|---|---|
judgeVerdict | { verdict: 'pass'|'partial'|'fail', reasoning: string } | EvaluationVerdictResult |
judgeScore | { score: number, reasoning: string } | EvaluationNumberResult |
judgeBoolean | { result: boolean, reasoning: string } | EvaluationBooleanResult |
judgeLabel | { label: string, reasoning: string } | EvaluationStringResult |
JudgeArgs object:
| Field | Type | Description |
|---|---|---|
prompt | string | Prompt filename (e.g., 'judge_topic@v1') |
variables | Record<string, string | number | boolean> | Template variables |
schema | ZodType | Custom output schema (overrides the default) |
.prompt files live in the same tests/evals/ directory as your evaluators:
tests/evals/judge_topic@v1.prompt
Creating an Eval Workflow
evalWorkflow() ties your evaluators together into a workflow that the CLI can run against datasets:
tests/evals/workflow.ts
evals array defines:
| Field | Type | Default | Description |
|---|---|---|---|
evaluator | Function | — | An evaluator created with verify() |
criticality | 'required' | 'informational' | 'required' | Whether failure should fail the case |
interpret | InterpretConfig | — | How to convert the evaluator result to a verdict |
tests/evals/workflow.ts
Criticality
required(default): If this evaluator fails, the entire case fails.informational: Failure is reported but doesn’t affect the case verdict. Use for metrics you want to track without gating on.
Interpret Types
Theinterpret config tells the framework how to convert the raw evaluator result into a pass/partial/fail verdict:
| Type | Config | Pass when | Partial when | Fail when |
|---|---|---|---|---|
boolean | { type: 'boolean' } | value === true | — | value === false |
verdict | { type: 'verdict' } | value === 'pass' | value === 'partial' | value === 'fail' |
number | { type: 'number', pass: 0.7, partial: 0.4 } | value >= pass | value >= partial | otherwise |
string | { type: 'string', pass: ['a', 'b'], partial: ['c'] } | value in pass | value in partial | otherwise |
partial threshold is optional for both number and string types — omit it to have only pass and fail.
Case Verdict Aggregation
Each dataset case runs all evaluators. The case-level verdict is determined by:- If any required evaluator fails → case fails
- Else if any required evaluator is partial → case is partial
- Otherwise → case passes
Datasets
Datasets are YAML files that live intests/datasets/ within your workflow directory. Each file defines one test case:
tests/datasets/basic_input.yml
| Field | Required | Description |
|---|---|---|
name | Yes | Unique name for this test case |
input | Yes | The workflow input |
ground_truth | No | Expected values for evaluators to check against |
last_output | No | Cached workflow output (used with --cached flag) |
last_eval | No | Cached evaluation results from the last run |
Ground Truth Structure
Ground truth supports global values and per-evaluator overrides:Directory Structure
Eval files live alongside your workflow in atests/ directory:
tests/evals/workflow.ts) is discovered automatically by the worker alongside your regular workflow.