tests/datasets/ within your workflow directory:
Basic Dataset
The simplest dataset has a name, input, and cached output:tests/datasets/basic_input.yml
last_output contains the workflow’s cached result. When you run evals with --cached, the framework uses this output instead of re-executing the workflow.
Dataset with Ground Truth
For evaluators that need expected values, add aground_truth section:
tests/datasets/stripe_blog.yml
Dataset Fields
| Field | Required | Description |
|---|---|---|
name | Yes | Unique name for this test case |
input | Yes | The workflow input (must match your workflow’s input schema) |
ground_truth | No | Expected values for evaluators to check against |
last_output | No | Cached workflow output (used with --cached flag) |
last_eval | No | Cached evaluation results from the last run |
Ground Truth Structure
Ground truth supports global values and per-evaluator overrides:context.ground_truth:
Managing Datasets with the CLI
Listing Datasets
Generating Datasets
You can generate datasets from scenario files, trace files, or production traces:What’s Next
- Running Eval Workflows — Wire evaluators into an eval workflow and run them from the CLI
- Workflow Evaluators — Writing evaluators with
verify(), Verdict helpers, and judge functions