writing-evals
Writing Evals
You write evaluations that prove AI capabilities work. Evals are the test suite for non-deterministic systems — they measure whether a capability still behaves correctly after every change.
If the task function uses the Vercel AI SDK, load the ai-sdk skill for correct generateText/streamText patterns.
Philosophy
- Evals are tests for AI. Every eval answers: "does this capability still work?"
- Scorers are assertions. Each scorer checks one property of the output.
- Data drives coverage. Happy path, adversarial, boundary, and negative cases.
- Read code first, ask later. Inspect the codebase and infer everything you can before asking.
How to Start
When the user asks you to write evals for an AI feature, read the code first.
Step 1: Understand the feature
- Find the AI function — search for the function the user mentioned. Read it fully.
- Trace the inputs — what data goes in? A string prompt, structured object, conversation history?
- Trace the outputs — what comes back? A string, category label, structured object, tool calls?
- Identify the model call — which LLM/model is used? What parameters?
- Check for existing evals — search for
*.eval.tsfiles. Don't duplicate what exists.
Step 2: Determine eval type
| Output type | Eval type | Scorer pattern |
|---|---|---|
| String category/label | Classification | exactMatch |
| Free-form text | Text quality | levenshtein or LLM-as-judge |
| Array of items | Retrieval/RAG | Set match + faithfulness |
| Structured object | Structured output | Field-by-field validation |
| Tool calls / agent result | Tool use | toolCallAccuracy |
Step 3: Choose scorers
Every eval needs at least 2 scorers. Layer them:
- Correctness (required) — Does the output match expected? Pick from the eval type table above.
- Quality (recommended) — Is the output well-formed? Check format, completeness, confidence.
- Reference-free (for user-facing text) — Is the output coherent, relevant? Use LLM-as-judge via custom scorer.
| Output type | Minimum scorers |
|---|---|
| Category label | exactMatch + confidence threshold |
| Free-form text | levenshtein + coherence (LLM-as-judge) |
| Structured object | Field match + field completeness |
| Tool calls | toolCallAccuracy + argument validation |
| Retrieval results | Set match + faithfulness |
Step 4: Generate
- Create the
.eval.tsfile colocated next to the source file - Import the actual function — do not create a stub
- Write scorers (minimum 2, see step 3)
- Generate test data (see Data Design Guidelines)
- Ensure
evaliteandvitestare dev dependencies in the package
Only ask if you cannot determine:
- What "correct" means for ambiguous outputs (e.g., summarization quality)
- Whether the user wants pass/fail or partial credit scoring
Reference
For detailed patterns and type signatures, read these on demand:
references/api-reference.md— Full type signatures, import paths, built-in scorers, CLIreferences/templates/— Ready-to-use eval file templates (see below)
Templates
| Template | File | Use case |
|---|---|---|
| Minimal | references/templates/minimal.md |
Simplest starting point |
| Classification | references/templates/classification.md |
Category labels |
| Retrieval/RAG | references/templates/retrieval.md |
Document retrieval, RAG |
| Structured output | references/templates/structured-output.md |
JSON object validation |
| Tool use | references/templates/tool-use.md |
Agent tool-call validation |
Data Design Guidelines
Step 1: Check for existing data
Before generating test data:
- Search the codebase — look for JSON/CSV files, seed data, test fixtures, or existing
data:arrays - Ask the user — "Do you have an eval dataset or example inputs/outputs?"
If the user has data, use it directly or load with dynamic data: data: async () => [...].
Step 2: Generate test data from code
If no data exists, generate it by reading the AI feature's code:
- Read the system prompt — it defines valid outputs. Extract categories, labels, expected behavior.
- Read the input type — generate realistic examples of that shape.
- Read any validation/parsing — tells you what correct output looks like.
- Look at constants/enums — if the feature classifies, use those as expected values.
Step 3: Cover all categories
| Category | What to generate |
|---|---|
| Happy path | Clear, unambiguous inputs with obvious correct answers |
| Adversarial | Prompt injection, misleading inputs |
| Boundary | Empty input, ambiguous intent, mixed signals |
| Negative | Inputs that should return empty/unknown/no-tool |
Minimum: 5-8 cases for a basic eval. 15-20 for production coverage.
Always add metadata for categorization:
{ input: "...", expected: "...", metadata: { purpose: "happy-path" } }
Guardrails
- Don't guess import paths — read
node_modules/evalitetypes if unsure - Don't create stubs — import the real function being evaluated
- Don't skip scorers — minimum 2 per eval, no exceptions
- Don't duplicate evals — check for existing
*.eval.tsfiles first - Do colocate —
.eval.tsnext to the source file it tests - Do read code first — never write evals from assumptions
- Do ensure deps —
evaliteandvitestmust be dev dependencies in the target package
More from maxmurr/skills
index-knowledge
Generate hierarchical AGENTS.md knowledge base for a codebase (root + complexity-scored subdirs), then align CLAUDE.md symlinks so Cursor/Claude see the same content. Use when user runs /index-knowledge, asks to regenerate AGENTS.md hierarchy, or refresh codebase knowledge docs.
8prompt-master
Generates effective, well-structured prompts for LLMs using the Anthropic Prompt Template technique. Use when the user wants to create a new LLM prompt, restructure an existing prompt, or improve prompt quality. Do not use for general text writing, non-LLM content generation, prompt debugging, prompt evaluation, or running/testing prompts.
2write-a-skill
Create new agent skills with proper structure, progressive disclosure, and bundled resources. Use when user wants to create, write, or build a new skill.
1