Testing Steps
Import the step and call it with input that matches itsinputSchema. The return value is validated against outputSchema automatically.
steps.spec.ts
inputSchema or the return value doesn’t match outputSchema, it throws a ValidationError — see Error Handling.
This test calls the real step implementation, including any API calls it makes. For fast, isolated tests, mock the external dependencies — see Mocking Steps and Dependencies below.
Testing Evaluators
Same pattern. Import the evaluator and call it with input. The result is anEvaluationResult with value, confidence, and optional reasoning.
Deterministic evaluators are especially good candidates for unit tests since they have predictable outputs:
evaluators.spec.ts
Testing Workflows
When a workflow runs outside Temporal (like in Vitest), it executes as a plain async function with a mock context. You’ll typically want to mock your steps so the workflow logic runs without real API calls — this is where deterministic testing shines. You’re testing your branching logic, retry loops, and data flow, not the LLM output.workflow.spec.ts
Mock Context
Outside Temporal, the framework injects a default mock context:context.info.workflowId→'test-workflow'context.control.continueAsNew→ no-op async functioncontext.control.isContinueAsNewSuggested()→false
context in the second argument. This is useful when your workflow logic depends on the workflow ID or other context values:
context override only applies outside Temporal. When running in production, the real Temporal context is used.
Mocking Steps and Dependencies
In workflow tests, mock your step modules so no real I/O happens:sleep or other utilities from @outputai/core, mock those too:
Testing Retry Logic
If your workflow has a generate-evaluate-retry loop, you can test the branching by controlling what the mocked evaluator returns:workflow.spec.ts
What to Test vs. What to Evaluate
| What you’re checking | Best tool |
|---|---|
| Data transformations, schema validation | Unit tests |
| Workflow branching and retry logic | Unit tests (mock steps + evaluators) |
| Step calls the right API with the right params | Unit tests (mock the client) |
| LLM output quality (accuracy, tone, relevance) | Evaluators (LLM-as-a-judge) |
| End-to-end behavior across multiple LLM calls | Tracing + human review |
| Regressions in LLM output over time | Offline evals with saved datasets |
Offline Evaluation with Datasets
Unit tests verify deterministic logic. But how do you catch regressions in LLM output quality across a set of known inputs? That’s what offline evaluation is for. The@outputai/evals package lets you run your workflow against saved datasets and check output quality using typed evaluators — without modifying your workflow code. You define evaluators with verify(), write dataset YAML files with expected inputs and ground truth, and run them via the CLI:
@outputai/evals package docs for the full guide on creating evaluators, writing datasets, and configuring eval workflows. See CLI Evaluation Commands for the command reference.