eval-writer
Eval Writer
Create new eval suites for the deepagentsjs monorepo. Each eval is an
independent workspace package under evals/ that uses the @deepagents/evals
harness, runs via vitest, and reports results to LangSmith.
Before you start
Read the existing eval infrastructure to understand current patterns:
internal/eval-harness/src/index.ts # EvalRunner, RunAgentParams, matchers
internal/eval-harness/src/deepagent.ts # DeepAgentEvalRunner, extend()
internal/eval-harness/src/setup.ts # Registered runners
evals/README.md # User-facing docs
internal/eval-harness/README.md # Harness internals
Scan existing evals for conventions:
evals/basic/index.test.ts # Simple: system prompt, reasoning
evals/files/index.test.ts # File ops: read, write, edit, glob, grep
evals/subagents/index.test.ts # Delegation: task tool, named subagents
Workflow
1. Understand the eval requirements
Clarify with the user:
- What capability is being evaluated? (file ops, tool use, multi-turn reasoning, memory, code generation, etc.)
- Where do test cases come from? Options:
- Inline — hardcoded in the test file (simple, good for <20 cases)
- JSON/JSONL fixture — checked into the eval package (good for 20-200 cases)
- External dataset — downloaded at setup time (good for published benchmarks)
- LangSmith dataset — pulled from LangSmith API (good for collaborative curation)
- How should results be scored? Options:
- Trajectory matchers — step count, tool calls, final text (built-in)
- Exact/fuzzy match — compare output to reference (simple)
- LLM-as-judge — use a model to grade the output (complex evals)
- Code execution — run generated code and check results (SWE-bench style)
- Custom evaluator — domain-specific scoring function
- Does the agent need special configuration? (custom tools, subagents, system prompt, initial files)
2. Create the eval package
Every eval is a workspace package under evals/<name>/.
Directory structure
evals/<name>/
├── package.json
├── vitest.config.ts
├── index.test.ts
├── README.md
└── (optional) fixtures/ # JSON/JSONL test data
└── (optional) vitest.setup.ts # Dataset loading, custom setup
└── (optional) evaluators.ts # Custom scoring functions
package.json
{
"name": "@deepagents/eval-<name>",
"private": true,
"type": "module",
"scripts": {
"test:eval": "vitest run"
},
"dependencies": {
"@deepagents/evals": "workspace:*",
"deepagents": "workspace:*",
"langsmith": "^0.5.4",
"vitest": "^4.0.18"
}
}
Add extra dependencies as needed (e.g. zod for tool schemas, langchain
for tool() helper, dataset-specific packages).
vitest.config.ts
import { defineConfig } from "vitest/config";
export default defineConfig({
test: {
environment: "node",
globals: false,
testTimeout: 120_000,
hookTimeout: 60_000,
teardownTimeout: 60_000,
include: ["**/*.test.ts"],
setupFiles: ["@deepagents/evals/setup"],
reporters: ["default", "langsmith/vitest/reporter"],
},
});
Adjust testTimeout for long-running evals (multi-turn, code execution).
Add "./vitest.setup.ts" to setupFiles if the eval needs custom setup (dataset loading, etc.).
README.md
# <name>
<One-line description of what this eval tests.>
Verify workspace registration
Check that pnpm-workspace.yaml includes "evals/*". It should already
be there — if not, add it.
3. Design test cases
Pattern A: Inline test cases (simple evals)
Best for small, hand-crafted test suites. Each test is an ls.test() call
with inputs and optional referenceOutputs.
ls.test(
"descriptive test name",
{
inputs: { query: "What is 2+2?" },
referenceOutputs: { expectedAnswer: "4" },
},
async ({ inputs, referenceOutputs }) => {
const result = await runner.run({ query: inputs.query });
// assertions...
},
);
Pattern B: Data-driven with ls.test.each (medium evals)
Best for 10-200 cases from a fixture file. Load the data and iterate:
import testCases from "./fixtures/cases.json";
// testCases = [{ inputs: { query: "..." }, referenceOutputs: { answer: "..." } }, ...]
ls.test.each(testCases)(
"case: ${inputs.query}",
async ({ inputs, referenceOutputs }) => {
const result = await runner.run({ query: inputs.query });
// assertions using referenceOutputs...
},
);
The fixture JSON must be an array of objects with at minimum { inputs: {...} }.
Optional fields: referenceOutputs, id, metadata, split.
Pattern C: External dataset (published benchmarks)
For published benchmarks (oolong, AgentBench, SWE-bench, etc.), download and cache the dataset in a setup file.
Create vitest.setup.ts:
import { existsSync, mkdirSync, writeFileSync, readFileSync } from "fs";
import { join } from "path";
const CACHE_DIR = join(import.meta.dirname, ".cache");
const DATA_PATH = join(CACHE_DIR, "dataset.json");
export async function loadDataset(): Promise<TestCase[]> {
if (existsSync(DATA_PATH)) {
return JSON.parse(readFileSync(DATA_PATH, "utf-8"));
}
mkdirSync(CACHE_DIR, { recursive: true });
// Download from source — adapt to the specific benchmark
const response = await fetch("https://example.com/dataset.json");
const data = await response.json();
// Transform into eval format
const cases = data.map((item: any) => ({
inputs: { query: item.question },
referenceOutputs: { answer: item.gold_answer },
metadata: { source: item.id, category: item.category },
}));
writeFileSync(DATA_PATH, JSON.stringify(cases, null, 2));
return cases;
}
Add .cache/ to .gitignore in the eval package.
Then register it as a vitest setup file in vitest.config.ts:
setupFiles: ["@deepagents/evals/setup", "./vitest.setup.ts"],
And in the test file:
import { loadDataset } from "./vitest.setup.js";
const dataset = await loadDataset();
ls.describe(runner.name, () => {
ls.test.each(dataset)(
"${metadata.source}: ${inputs.query}",
async ({ inputs, referenceOutputs }) => {
// ...
},
);
}, { projectName: "deepagents-js-<name>", upsert: true });
Pattern D: LangSmith dataset
Pull test cases from a LangSmith dataset. Useful for collaborative curation where non-engineers add examples via the LangSmith UI.
import { Client } from "langsmith";
const client = new Client();
export async function loadDataset(): Promise<TestCase[]> {
const examples = [];
for await (const example of client.listExamples({
datasetName: "my-dataset-name",
})) {
examples.push({
id: example.id,
inputs: example.inputs,
referenceOutputs: example.outputs ?? {},
});
}
return examples;
}
4. Write scoring logic
Built-in trajectory matchers
The harness provides vitest matchers that also log LangSmith feedback scores. Use these as the primary building blocks:
// Exact step count
expect(result).toHaveAgentSteps(3);
// Exact tool-call count across all steps
expect(result).toHaveToolCallRequests(2);
// Check a specific tool call in step N (1-indexed)
expect(result).toHaveToolCallInStep(1, {
name: "write_file",
argsContains: { file_path: "/out.txt" }, // partial match
argsEquals: { file_path: "/out.txt" }, // exact match
});
// Final response text
expect(result).toHaveFinalTextContaining("hello", true /* caseInsensitive */);
// Extract final text for custom assertions
import { getFinalText } from "@deepagents/evals";
const text = getFinalText(result);
expect(text.trim()).toBe("4");
// File system assertions
expect(result.files["/output.md"]).toContain("expected content");
expect(Object.keys(result.files)).toHaveLength(3);
Custom feedback logging
Log additional LangSmith feedback scores beyond what matchers provide:
import * as ls from "langsmith/vitest";
// Numeric score
ls.logFeedback({ key: "accuracy", score: 0.95 });
// Boolean score
ls.logFeedback({ key: "correct", score: 1 });
// With comment
ls.logFeedback({ key: "quality", score: 0.8, comment: "Minor formatting issue" });
LLM-as-judge evaluators
For subjective quality, use ls.wrapEvaluator() to create a traced evaluator
that logs feedback automatically:
import * as ls from "langsmith/vitest";
import { ChatAnthropic } from "@langchain/anthropic";
const judge = new ChatAnthropic({ model: "claude-sonnet-4-5-20250929" });
const evaluateHelpfulness = ls.wrapEvaluator(
async ({ inputs, outputs, referenceOutputs }) => {
const response = await judge.invoke([
{
role: "system",
content: `Rate the helpfulness of the assistant's response on a scale of 0-1.
Respond with JSON: { "score": <number>, "reasoning": "<explanation>" }`,
},
{
role: "user",
content: `Question: ${inputs.query}\nExpected: ${referenceOutputs.answer}\nActual: ${outputs.response}`,
},
]);
const parsed = JSON.parse(response.content as string);
return {
key: "helpfulness",
score: parsed.score,
comment: parsed.reasoning,
};
},
);
// In a test:
const result = await runner.run({ query: inputs.query });
const text = getFinalText(result);
await evaluateHelpfulness({
inputs: { query: inputs.query },
outputs: { response: text },
referenceOutputs: referenceOutputs ?? {},
});
5. Wire up the test file
Minimal template
import * as ls from "langsmith/vitest";
import { expect } from "vitest";
import { getDefaultRunner, getFinalText } from "@deepagents/evals";
const runner = getDefaultRunner();
ls.describe(
runner.name,
() => {
ls.test(
"test name",
{
inputs: { query: "..." },
referenceOutputs: { answer: "..." },
},
async ({ inputs, referenceOutputs }) => {
const result = await runner.run({ query: inputs.query });
expect(result).toHaveAgentSteps(1);
expect(result).toHaveFinalTextContaining(referenceOutputs.answer);
},
);
},
{ projectName: "deepagents-js-<name>", upsert: true },
);
Key conventions
getDefaultRunner()— readsEVAL_RUNNERenv var. Throws if not set.runner.name— used asls.describename → becomes the LangSmith dataset name.runner.run({ query, initialFiles? })— pure invocation. ReturnsAgentTrajectory.runner.extend({ systemPrompt?, tools?, subagents?, ... })— returns a new runner with agent config overrides. Use for tests that need custom agent setup.projectNameinls.describeconfig — sets the LangSmith project for tracing. Convention:"deepagents-js-<eval-name>".upsert: true— reuse existing dataset/project instead of creating new ones each run.- Always import
expectfromvitest— the harness extends it with custom matchers at import time. ls.logOutputs()is called inside the runner — do NOT call it in test code.
Using extend() for custom agent config
// Custom system prompt
const result = await runner
.extend({ systemPrompt: "You are a code reviewer." })
.run({ query: inputs.query });
// Custom tools
const result = await runner
.extend({ tools: [myCustomTool] })
.run({ query: inputs.query });
// Custom subagents
const result = await runner
.extend({
subagents: [{
name: "researcher",
description: "Research assistant",
systemPrompt: "You help with research.",
tools: [searchTool],
}],
})
.run({ query: inputs.query });
Using initialFiles for seeded state
const result = await runner.run({
query: "Read /data.csv and count the rows.",
initialFiles: {
"/data.csv": "name,age\nAlice,30\nBob,25\n",
},
});
Sandbox-backed evals (containerized execution)
The default eval runners use the in-memory StateBackend — the agent can
read/write files but cannot execute shell commands, install packages, or
interact with a real OS. This is fine for testing tool selection, reasoning,
and file operations.
For evals that need real execution (SWE-bench, code generation, agentic benchmarks), the agent must run against a sandbox backend. Available sandbox providers:
| Provider | Package | Use case |
|---|---|---|
| Modal | @deepagents/modal |
Remote containers, GPU support |
| Daytona | @deepagents/daytona |
Cloud dev environments |
| Deno | @deepagents/deno |
Lightweight local sandboxes |
| Node VFS | @deepagents/node-vfs |
In-process virtual filesystem + shell |
Pass the sandbox via extend({ backend }). Manage its lifecycle with
beforeAll / afterAll (suite-level) or beforeEach / afterEach
(per-test isolation):
import * as ls from "langsmith/vitest";
import { expect, beforeAll, afterAll } from "vitest";
import { getDefaultRunner, getFinalText } from "@deepagents/evals";
import { ModalSandbox } from "@deepagents/modal";
const runner = getDefaultRunner();
let sandbox: ModalSandbox;
beforeAll(async () => {
sandbox = await ModalSandbox.create({
image: "python:3.12-slim",
timeout: 600,
});
});
afterAll(async () => {
await sandbox?.terminate();
});
ls.describe(
runner.name,
() => {
ls.test(
"agent can run python",
{ inputs: { query: "Write a Python script that prints 'hello' and run it." } },
async ({ inputs }) => {
const result = await runner
.extend({ backend: sandbox })
.run({ query: inputs.query });
expect(result).toHaveFinalTextContaining("hello");
},
);
},
{ projectName: "deepagents-js-sandbox-eval", upsert: true },
);
For per-test isolation (each test gets a fresh sandbox):
import { beforeEach, afterEach } from "vitest";
let sandbox: ModalSandbox;
beforeEach(async () => {
sandbox = await ModalSandbox.create({ image: "python:3.12-slim" });
});
afterEach(async () => {
await sandbox?.terminate();
});
When to containerize:
- The eval requires
execute()(shell commands) - The eval involves installing packages or modifying system state
- The eval runs untrusted or generated code
- Tests need filesystem isolation from each other
When in-memory is fine:
- Testing tool selection, reasoning, or response quality
- File read/write/edit operations (handled by
StateBackend+initialFiles) - System prompt adherence, subagent routing
Add the sandbox provider to package.json dependencies:
{
"dependencies": {
"@deepagents/modal": "workspace:*"
}
}
And increase testTimeout in vitest.config.ts — sandbox creation adds
overhead:
testTimeout: 300_000, // 5 minutes for sandbox evals
hookTimeout: 120_000, // sandbox setup/teardown
6. Install and verify
# From repo root
pnpm install
# Build the harness (if you changed it)
cd internal/eval-harness && pnpm build && cd ../..
# Run the new eval
EVAL_RUNNER=sonnet-4-5 pnpm --filter @deepagents/eval-<name> test:eval
7. Update documentation
Add the new eval to evals/README.md in the "Available eval suites" table:
| [`<name>/`](./<name>/) | <one-line description> |
Parity with Python deepagents evals
The Python deepagents package has eval suites in
libs/deepagents/tests/evals/. The JS evals should maintain parity.
When creating a new eval, check the Python source at
https://github.com/langchain-ai/deepagents/blob/v0.5/libs/deepagents/tests/evals/
for the reference implementation.
Current parity status
| Python eval | JS eval | Status |
|---|---|---|
test_system_prompt.py |
evals/basic/ |
✅ Covered |
test_file_operations.py |
evals/files/ |
✅ Covered |
test_subagents.py |
evals/subagents/ |
✅ Covered |
test_memory.py |
evals/memory/ |
✅ Covered |
test_hitl.py |
evals/hitl/ |
✅ Covered |
test_skills.py |
evals/skills/ |
✅ Covered |
test_summarization.py |
evals/summarization/ |
❌ Missing |
Notes on HITL evals
HITL evals require multi-step invocation (invoke → check interrupts → resume
with Command). The eval runner's run() does a single invocation, so HITL
tests construct agents directly via createDeepAgent() with a checkpointer
and interruptOn config. See evals/hitl/index.test.ts for the pattern.
Notes on summarization evals
Summarization evals need SummarizationMiddleware with low token thresholds,
a checkpointer, a real/virtual filesystem backend, and multi-turn invocations.
These tests would bypass the standard EvalRunner and construct agents
directly, similar to HITL.
Reference: ls.test.each API
ls.test.each is the most powerful pattern for data-driven evals. The table
must be an array of objects with at least { inputs }:
ls.test.each([
{ inputs: { query: "Q1" }, referenceOutputs: { answer: "A1" } },
{ inputs: { query: "Q2" }, referenceOutputs: { answer: "A2" } },
// Optional additional fields: id, metadata, split
{ id: "custom-id", inputs: { query: "Q3" }, referenceOutputs: { answer: "A3" }, split: "hard" },
])(
"case: ${inputs.query}", // Name template — interpolates from row
async ({ inputs, referenceOutputs, testMetadata }) => {
// testMetadata.exampleId, testMetadata.datasetId, etc.
const result = await runner.run({ query: inputs.query });
// ...
},
);
Reference: LangSmith integration
How datasets map
| Concept | LangSmith entity |
|---|---|
ls.describe(name, ...) |
Dataset (name = dataset name) |
ls.test(name, { inputs, referenceOutputs }, fn) |
Example in dataset |
| Running the test suite | Experiment on the dataset |
ls.logFeedback(...) |
Feedback on the experiment run |
ls.logOutputs(...) |
Experiment output (called by runner) |
Environment variables
| Variable | Purpose |
|---|---|
EVAL_RUNNER |
Which model runner to use (e.g. sonnet-4-5) |
LANGSMITH_API_KEY |
LangSmith auth |
LANGSMITH_PROJECT |
Override tracing project (normally set via projectName) |
LANGSMITH_TEST_TRACKING |
Set to "false" to disable LangSmith reporting |
ANTHROPIC_API_KEY |
For Anthropic model runners |
OPENAI_API_KEY |
For OpenAI model runners |
Available runners
Defined in internal/eval-harness/src/setup.ts:
| Runner name | Model |
|---|---|
sonnet-4-5 |
Claude Sonnet 4.5 |
sonnet-4-5-thinking |
Claude Sonnet 4.5 with extended thinking |
opus-4-6 |
Claude Opus 4.6 |
gpt-4.1 |
GPT-4.1 |
gpt-4.1-mini |
GPT-4.1 Mini |
o3-mini |
o3-mini |
Reference: Implementing published benchmarks
When implementing an existing benchmark, follow these attribution and methodology guidelines.
Attribution
Always credit the original benchmark authors. In the eval's README.md:
# <benchmark-name>
Implementation of [<Benchmark Name>](<paper-or-repo-url>) by <authors> (<year>).
> <brief description from the paper abstract>
## Citation
\`\`\`bibtex
@article{...}
\`\`\`
## Adaptations
<Describe any differences from the original benchmark methodology:>
- <e.g. "Subset of N cases selected for cost efficiency">
- <e.g. "Adapted for agentic tool-use evaluation rather than direct QA">
- <e.g. "Uses LLM-as-judge instead of human annotation">
Common benchmark patterns
QA / Factual benchmarks (e.g. MMLU, TrivialQA)
- Test cases: question + gold answer
- Scoring: exact match or fuzzy match on final text
- Pattern:
ls.test.eachwith fixture JSON
Multi-turn / Conversational (e.g. MT-Bench)
- Test cases: conversation turns that build on each other
- Scoring: LLM-as-judge on each turn
- Pattern: sequential
runner.run()calls sharing state, or multi-message queries
Tool-use benchmarks (e.g. ToolBench, API-Bank)
- Test cases: task requiring specific tool calls
- Scoring: trajectory matchers (correct tools called, correct args)
- Pattern:
runner.extend({ tools: [...] })with custom tools that return canned responses
Code generation (e.g. HumanEval, SWE-bench)
- Test cases: problem description + test suite
- Scoring: execute generated code, check test results
- Requires sandbox: yes — agent must run code and observe output
- Pattern:
runner.extend({ backend: sandbox }), write code to file, execute tests via sandbox
Memory / Long-context (e.g. oolong, needle-in-haystack)
- Test cases: large context + retrieval question
- Scoring: whether the agent finds the target information
- Requires sandbox: no — seed files via
initialFiles, check final text - Pattern: in-memory
StateBackendis sufficient
Agentic benchmarks (e.g. AgentBench, WebArena)
- Test cases: multi-step tasks requiring planning
- Scoring: combination of trajectory analysis + final state checking
- Requires sandbox: yes — agent needs shell access, package installs, environment interaction
- Pattern:
runner.extend({ backend: sandbox })with per-test sandbox isolation
Handling large datasets
For benchmarks with thousands of cases:
- Subset selection — pick a representative subset (e.g. 100 cases per category). Document the selection criteria.
- Split support — use
splitfield in test cases to categorise (e.g."easy","hard"). Run subsets via vitest filtering. - Caching — download once, cache in
.cache/(gitignored). - Cost awareness — estimate API cost before running. Log it in the README. Consider a
--dry-runthat validates fixtures without calling the LLM.