tdd
Test-Driven Development (TDD)
Overview
This skill implements Canon TDD with AI-specific guardrails:
- Build or update a scenario list.
- Execute exactly one scenario as a runnable test.
- Prove RED.
- Implement minimum change for GREEN.
- Optionally refactor.
- Repeat until scenario list is empty.
When to Use
Use for:
- New features
- Bug fixes
- Behavior changes
- Repository-scale patching driven by tests
- AI-assisted code generation where tests are executable specifications
Ask human approval before bypassing only for:
- Throwaway prototypes
- Purely declarative config edits with no execution path
- One-off migration scripts that will not be maintained
The Iron Law
NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST
If code was written first, discard and restart from RED.
Canon Loop
Step 0: Create/refresh scenario backlog
Before building the backlog, query memory for past failure signatures and reusable test templates:
Skill({ skill: 'memory-search' }); // query: "<feature-name> test failure signatures"
Read .claude/context/memory/learnings.md for recurring anti-patterns relevant to this task.
Then:
- Keep a short ordered list of test scenarios for this task.
- Prioritize by design signal and risk, not by implementation convenience.
- Add discovered scenarios during execution.
- Reuse templates from memory — do not repeat failure patterns already documented.
Step 1: Pick exactly one scenario and write one runnable test
- One behavior per cycle.
- Use clear behavior names.
- Favor real collaborators; mock only external boundaries.
Step 2: Prove RED
- Run the narrowest test command.
- Failure must be due to missing behavior, not syntax or setup errors.
- Record red evidence (test file and failing assertion message).
Step 3: Implement minimum GREEN patch
- Implement only what current red test requires.
- No speculative APIs or unrelated cleanup.
- Keep patch bounded to current scenario.
Step 4: Prove GREEN
- Re-run narrow test command.
- Run impacted suite (or package-level test set).
- Confirm no regressions.
Flakiness Gate (mandatory for async, hook, or nondeterministic tests):
For tests that involve async I/O, stop hooks, timers, or file system operations, a single pass is insufficient. Require 3 consecutive passes before declaring GREEN:
# Run 3 times — all 3 must pass
node --test tests/hooks/routing-guard.test.cjs && \
node --test tests/hooks/routing-guard.test.cjs && \
node --test tests/hooks/routing-guard.test.cjs
A test that passes once and fails on the second run is RED, not GREEN. Do not advance to Step 5 until 3 consecutive passes are confirmed.
Step 5: Optional refactor
- Refactor only with green tests.
- Re-run the same test set after refactor.
Step 6: Repeat until backlog empty
AI-Assisted Guardrails
- Use tests as executable prompt context; keep prompts short and test-focused.
- Prefer deterministic tests (stable fixtures, no nondeterministic ordering).
- Use bounded repair loops: max 3 repair attempts per scenario before redesign.
- Run anti-test-hacking checks:
- Verify changed assertions still express original requirement.
- Add at least one negative test for bug-fix tasks.
- Ensure code does not branch on test-only artifacts.
Memory Acceleration Layer
Use lightweight memory only to reduce repeated setup and triage:
- preferred repo-local test/lint/format commands
- recurring failure signatures and short fix summaries
- recurring anti-pattern reminders
- reusable scenario templates
Reference: references/tdd-memory-profile.md
Hard rules:
- memory never bypasses RED proof
- memory never changes Canon sequence
- keep profile bounded and low-noise
Test-Driven Prompting (TDP) — 2026 Standard Pattern
TDP is the dominant 2026 pattern for multi-agent TDD: inject the verbatim failing test output into the developer agent spawn prompt. This eliminates interpretation errors — the developer sees exactly what the test runner sees.
Pattern
Instead of describing the failure in prose, capture stdout/stderr and inject it directly:
// Step 1: Run test and capture raw output
const { execSync } = require('child_process');
let testOutput = '';
try {
execSync('node --test tests/hooks/routing-guard.test.cjs', { encoding: 'utf-8' });
} catch (e) {
testOutput = e.stdout + e.stderr; // Verbatim failure output
}
// Step 2: Inject verbatim into developer spawn prompt (no paraphrasing)
Task({
task_id: 'task-impl',
subagent_type: 'developer',
prompt: `## FAILING TEST (verbatim — do NOT modify the test file)\n\`\`\`\n${testOutput}\n\`\`\`\nImplement ONLY what is needed to make this pass.`,
});
Why TDP Works
- Eliminates paraphrased failure descriptions (telephone game effect)
- Developer has the full assertion context: line number, actual vs expected values
- Forces minimal implementation — developer can only implement what the test demands
- Prevents specification drift between QA agent's test intent and developer's interpretation
TDP + Multi-Agent TDD Decomposition
| Step | Agent | Action |
|---|---|---|
| 1 | qa |
Write failing test, commit test-only, capture raw output |
| 2 | Router | Extract test output, build TDP spawn prompt |
| 3 | developer |
Implement to GREEN using verbatim test output as spec |
| 4 | reflection-agent |
Verify no test assertions were modified (git diff check) |
Source: Simon Willison (2026) — "Red/Green TDD for agents: failing test output IS the specification"; TDFlow arXiv:2510.23761.
Autonomous TDD with ralph-loop (Session-Persistent Iteration)
For repository-scale TDD where sessions may be interrupted, wire ralph-loop (Mode 2 — router-managed) to maintain the TDD scenario backlog across interruptions:
TDD State Schema
Maintain a TDD-specific state file at .claude/context/runtime/tdd-state.json:
{
"scenarios": [
{
"id": "sc-001",
"description": "routing-guard blocks Write on creator paths",
"status": "pending"
},
{ "id": "sc-002", "description": "spawn-token-guard warns at 80K tokens", "status": "green" }
],
"completedScenarios": [
{
"id": "sc-002",
"evidenceCommand": "node --test tests/hooks/spawn-token-guard.test.cjs",
"passedAt": "2026-03-12T10:00:00Z"
}
],
"currentScenario": "sc-001",
"evidenceLog": [
{
"scenarioId": "sc-001",
"phase": "red",
"output": "AssertionError: expected exit code 2, got 0",
"timestamp": "..."
}
]
}
Resume Pattern
At the start of each iteration, read the TDD state file:
// Step 0 — before building/refreshing backlog
const state = JSON.parse(
fs.readFileSync('.claude/context/runtime/tdd-state.json', 'utf-8') || '{}'
);
const completedIds = (state.completedScenarios || []).map(s => s.id);
const remaining = (state.scenarios || []).filter(s => !completedIds.includes(s.id));
// Pick next scenario from remaining — never re-run completed ones
Integration with ralph-loop Mode 2
- Router spawns
qaagent with{ task_id, subagent_type: 'qa', prompt: TDP_PROMPT + verbatim state } qawrites test → runs → captures output → updatestdd-state.json(phase: red)- Router spawns
developerwith TDP prompt (verbatim test output injected) developerimplements → updatestdd-state.json(phase: green)- Router checks
remaining.length === 0→ emitRALPH_AUDIT_COMPLETE_NO_FINDINGS - If remaining > 0 → loop back to step 1 with next scenario
Anti-pattern: Never re-run scenarios already marked green in state — this wastes iterations and may corrupt evidence logs.
Repository-Scale and Class-Level Guidance
- For repository-scale work, decompose by failing test cluster and assign one cluster per loop.
- For class-level synthesis, derive a method dependency order and implement one method at a time with method-level public tests.
- Keep long-context pressure low by limiting each loop to one scenario and one patch objective.
Verification Checklist
- Scenario backlog exists and was updated during work
- Every production change maps to at least one failing-then-passing test
- RED evidence captured (command + failure summary)
- GREEN evidence captured (command + pass summary)
- No unresolved failing tests in touched scope
- Lint/format/test commands completed or explicitly reported as blocked
- No detected test-hacking pattern
Pre-Completion Commands (Project-Scoped)
Use the project's actual commands. Typical sequence:
# 1) targeted test
pnpm test <target>
# 2) impacted suite
pnpm test
# 3) lint
pnpm lint
# 4) format check
pnpm format:check
If the repo uses different scripts, replace these with local equivalents and report exactly what ran.
Rationalization Countermeasures
- "I will add tests later" -> stop and write current red test.
- "This is too small to test" -> write one minimal behavior test.
- "I already manually tested" -> manual runs do not replace executable regression tests.
- "I spent too long to delete pre-test code" -> sunk cost; restart from RED.
Related Files
references/research-requirements.mdreferences/tdd-memory-profile.mdtesting-anti-patterns.mdrules/tdd.mdtemplates/implementation-template.md
Research Basis
This skill is aligned with:
- Martin Fowler TDD (Dec 11, 2023)
- Kent Beck Canon TDD (Dec 11, 2023)
- Rafique & Misic meta-analysis, IEEE TSE DOI:10.1109/TSE.2012.28
- LLM4TDD (arXiv:2312.04687)
- Test-Driven Development for Code Generation (arXiv:2402.13521)
- Tests as Prompt (arXiv:2505.09027)
- SWE-Flow (arXiv:2506.09003)
- TDFlow (arXiv:2510.23761)
- Scaling TDD from Functions to Classes (arXiv:2602.03557)
Memory Protocol
Before starting:
Read .claude/context/memory/learnings.md
After completing:
- New pattern ->
.claude/context/memory/learnings.md - Issue found ->
.claude/context/memory/issues.md - Decision made ->
.claude/context/memory/decisions.md
Assume interruption: if it is not in memory, it did not happen.
Agent-Studio TDD Extensions (2026)
Hook Testing Pattern
Hooks use stdin/stdout JSON protocol:
const proc = require('child_process').spawn('node', ['.claude/hooks/routing/routing-guard.cjs'], {
shell: false,
});
proc.stdin.write(JSON.stringify({ tool_name: 'Write', tool_input: {} }));
proc.stdin.end();
// Exit 0=allow, 2=block
Memory TDD
Mock MemoryRecord. Test confidence gate (threshold 0.7). Use atomic writes.
Property-Based Testing
Use fast-check (and @fast-check/vitest for vitest integration) for any function with invariants — not just routing. fast-check 3.x (2025) adds improved unicode, date, and bigint arbitraries.
Routing invariant (existing):
import fc from 'fast-check';
fc.assert(
fc.property(fc.string(), intent => {
return typeof routeIntent(intent) === 'string';
})
);
Memory serialization roundtrip (new):
// Property: serialize(deserialize(x)) === x for all JSON-serializable values
fc.assert(
fc.property(fc.jsonValue(), value => {
const serialized = serializeMemoryRecord(value);
const deserialized = deserializeMemoryRecord(serialized);
return JSON.stringify(deserialized) === JSON.stringify(value);
})
);
Hook validation invariant (new):
// Property: for any tool input, isValidInput(x) === !isBlocked(x)
// (validation and blocking must be inverses)
fc.assert(
fc.property(fc.record({ tool_name: fc.string(), tool_input: fc.object() }), input => {
const valid = isValidInput(input);
const blocked = wouldBlock(input);
return valid !== blocked || (!valid && blocked); // blocked implies invalid
})
);
Path normalization idempotency (new):
// Property: normalize(normalize(path)) === normalize(path) (idempotent)
fc.assert(
fc.property(fc.string(), rawPath => {
const once = normalizePath(rawPath);
const twice = normalizePath(once);
return once === twice;
})
);
Schema validation stability (new):
// Property: validate(schema, x) never throws uncaught exception for any input
fc.assert(
fc.property(fc.anything(), input => {
try {
validateSchema(schema, input);
return true;
} catch (e) {
return e instanceof ValidationError;
} // Only ValidationError allowed
})
);
Contract Testing
Validate TaskUpdate metadata schemas (processedReflectionIds: string[]).
Multi-Agent TDD Decomposition (2026 Standard)
Based on TDFlow (arXiv:2510.23761, 94.3% SWE-Bench Verified), monolithic TDD agents score 60–70%. Split into specialized sub-agents:
| Role | Agent | Responsibility |
|---|---|---|
| Test Author | qa |
Write failing test, commit test-only |
| Implementer | developer |
Implement to green — MUST NOT modify tests |
| Verifier | reflection-agent |
Detect test-hacking, verify RED→GREEN evidence |
Pattern:
- QA agent writes test → commits test file alone (no implementation)
- Developer agent implements → runs tests → commits implementation
- Reflection agent reviews diff: if test assertions changed → FAIL (test-hacking)
Test-hacking detection: reflection-agent checks git diff HEAD~1 HEAD -- '*.test.*' — any assertion changes after implementation commit = REJECT.
When to use: repository-scale TDD, complex features with multiple behaviors, any task where a single agent might rationalize test changes.
TDAID Phase Mapping (Test-Driven AI-Assisted Development, 2025-2026)
TDAID extends classic TDD with explicit Planning and Validation gates:
| Phase | TDAID Label | Agent-Studio Owner | Description |
|---|---|---|---|
| 0 | Plan | planner |
Thinking-model generates structured TDD plan with explicit test checkpoints before any code is written |
| 1 | Red | qa |
Write failing test expressing desired behavior; human verifies failure is expected |
| 2 | Green | developer |
Minimal implementation to pass test; MUST NOT modify test assertions |
| 3 | Refactor | developer |
Improve code quality with all tests green |
| 4 | Validate | reflection-agent + verification-before-completion skill |
Detect specification gaming; confirm implementation matches plan; human gate |
Key TDAID anti-patterns to detect in Validate phase:
- Deleting test assertions to make tests pass
- Hardcoding expected values
- Mocking away the behavior being tested
- Making implementation superficially compliant without satisfying the specification intent
Research basis: TDAID (awesome-testing.com, 2025), TDAD agent-to-agent variant (arXiv:2603.08806, 2026), TDFlow (arXiv:2510.23761, 2025)
LSP Pre-RED Type Verification
Before writing a failing test, verify the API contract exists to prevent "fails due to wrong API" rather than "fails due to missing behavior":
# Step 1: Find the target function's file + line
pnpm search:code "functionName"
# Step 2: Verify signature with LSP hover
lsp_hover({ filePath: "/abs/path/to/file.ts", line: 42, character: 10 })
# Returns: function signature, parameter types, return type
# Step 3: Write test using VERIFIED signature
# Now RED is guaranteed to fail due to missing behavior, not API mismatch
Rule: If lsp_hover returns empty (CJS file or LSP not active) → fall back to ripgrep rg -n "functionName" --type ts to read the actual signature.
When NOT needed: trivially new functions that don't exist yet (LSP has nothing to return).
Contract Testing (Hook Boundaries — Expanded)
Hook contracts define the stdin/stdout JSON protocol. Test at the boundary:
// Hook contract test pattern
const proc = spawn('node', ['.claude/hooks/routing/routing-guard.cjs'], { shell: false });
const input = JSON.stringify({
tool_name: 'Edit',
tool_input: { file_path: '.claude/agents/core/developer.md' },
});
proc.stdin.write(input);
proc.stdin.end();
// Assert: exit code 2 (block) for protected paths
// Assert: stdout JSON contains { allow: false, message: /Gate 4/ }
TaskUpdate metadata contract:
// Validate processedReflectionIds schema
const schema = {
type: 'object',
required: ['processedReflectionIds'],
properties: { processedReflectionIds: { type: 'array', items: { type: 'string' } } },
additionalProperties: false,
};
Agent-Studio hook contracts to test:
routing-guard.cjs: blocks Task without task_id (exit 2)unified-creator-guard.cjs: blocks Write to.claude/skills/**/SKILL.md(exit 2)spawn-token-guard.cjs: warns at 80K tokens (exit 0 + message)
Test Runner Selection (node --test vs Vitest 4)
Agent Studio uses node --test (built-in Node.js test runner) as the default for all .cjs CommonJS files (hooks, lib, scripts). Vitest 4 is the recommended runner for ESM/TypeScript files.
| Runner | Use When | Command |
|---|---|---|
node --test |
.cjs hooks, lib, CommonJS scripts — current Agent Studio standard |
node --test tests/**/*.test.cjs |
vitest |
.ts, .mts, ESM .js files — use when migrating to TypeScript |
pnpm vitest run |
Why node --test for .cjs: Vitest requires Vite configuration and ESM-compatible modules. Agent Studio hooks use require() and CommonJS — node --test works without transpilation.
Why Vitest 4 for .ts/ESM: Boot time drops from ~8s (Jest) to ~1.2s (Vitest). First-class TypeScript + ESM support, Browser Mode (stable v4), and jest-compatible describe/it/expect API (migration = config change only).
Anti-pattern: Do NOT use Jest for new files. Vitest is the 2025-2026 standard for ESM/TypeScript.
# Current Agent Studio pattern (CJS hooks and lib)
node --test tests/lib/routing/routing-table.test.cjs
# Future ESM/TypeScript pattern
pnpm vitest run tests/lib/routing/routing-table.test.ts
AI Output Evaluation Testing (Non-Deterministic Agents)
LLM/agent outputs are non-deterministic — binary pass/fail assertions are insufficient. Use score-based evaluation and tool-call sequence validation instead.
Score-Based Assertion Pattern
// Agent output evaluation — score dimensions 0.0-1.0
function evaluateAgentOutput(output, expectations) {
const scores = {
relevance: scoreRelevance(output, expectations.topic), // 0.0-1.0
safety: scoreSafety(output), // 0.0-1.0
faithfulness: scoreFaithfulness(output, expectations.facts), // 0.0-1.0
format: scoreFormat(output, expectations.schema), // 0.0-1.0
};
const overall = Object.values(scores).reduce((a, b) => a + b) / Object.keys(scores).length;
return { scores, overall, pass: overall >= 0.75 };
}
// Test: agent output meets quality threshold
test('researcher agent output is relevant and safe', () => {
const result = evaluateAgentOutput(agentOutput, { topic: 'TDD patterns', facts: knownFacts });
expect(result.scores.safety).toBeGreaterThanOrEqual(0.9); // Hard floor for safety
expect(result.overall).toBeGreaterThanOrEqual(0.75); // 75% overall threshold
});
Tool-Call Sequence Validation
For agent tests, validate the sequence and count of tool calls, not just the final output:
// Spy on tool calls and assert ordering
const toolCallLog = [];
const mockTaskUpdate = jest.fn(args => {
toolCallLog.push({ tool: 'TaskUpdate', args });
});
const mockBash = jest.fn(args => {
toolCallLog.push({ tool: 'Bash', args });
});
// Run agent under test with mocked tools
await runAgent({ TaskUpdate: mockTaskUpdate, Bash: mockBash });
// Assert: TaskUpdate(in_progress) called BEFORE TaskUpdate(completed)
const inProgressIdx = toolCallLog.findIndex(
c => c.tool === 'TaskUpdate' && c.args.status === 'in_progress'
);
const completedIdx = toolCallLog.findIndex(
c => c.tool === 'TaskUpdate' && c.args.status === 'completed'
);
expect(inProgressIdx).toBeLessThan(completedIdx); // Ordering enforced
expect(inProgressIdx).toBeGreaterThanOrEqual(0); // Must have been called
expect(completedIdx).toBeGreaterThanOrEqual(0); // Must have been called
Rule: Never test the text content of LLM-generated prose. Test structure, schema validity, tool-call sequences, and score thresholds.
Reference: Simon Willison (2025) — "Red/Green TDD for agents: write assertions on tool-call sequences and structured outputs."
MSW v2 HTTP Mocking (API Boundary Testing)
Use MSW (Mock Service Worker) v2 to test skills and agents that make external HTTP calls. MSW intercepts at the network level — no monkey-patching of fetch, no code changes in production.
pnpm add -D msw@2
Setup Pattern (Node.js / Vitest)
import { setupServer } from 'msw/node';
import { http, HttpResponse } from 'msw';
// Define handlers — these describe the expected API contract
const handlers = [
http.get('https://api.example.com/search', ({ request }) => {
const url = new URL(request.url);
return HttpResponse.json({
results: [{ id: 1, title: `Result for: ${url.searchParams.get('q')}` }],
});
}),
];
const server = setupServer(...handlers);
beforeAll(() => server.listen({ onUnhandledRequest: 'error' }));
afterEach(() => server.resetHandlers());
afterAll(() => server.close());
// Test: researcher skill makes HTTP call and processes response
test('researcher skill fetches and parses search results', async () => {
const results = await researcherSkill.search('TDD patterns 2026');
expect(results).toHaveLength(1);
expect(results[0].title).toContain('TDD patterns');
});
Override Per-Test for Error Cases
test('researcher skill handles 503 gracefully', async () => {
server.use(
http.get('https://api.example.com/search', () => HttpResponse.json({}, { status: 503 }))
);
const results = await researcherSkill.search('TDD patterns');
expect(results).toEqual([]); // Graceful empty fallback
});
Key benefits over manual mocking:
- Tests exercise real HTTP client code paths (not mocked abstractions)
onUnhandledRequest: 'error'catches unintentional external calls during tests- Handlers define request/response contracts — doubles as documentation
Agent-Studio targets for MSW boundary tests:
researcherskill → WebSearch/WebFetch HTTP callsgithub-opsskill → GitHub API calls- Any agent using
mcp__Exa__web_search_exaorWebFetch
Mutation Testing (Stryker JS)
Mutation testing validates test QUALITY, not just coverage. Run after achieving 100% line coverage:
Stryker + Vitest (2026 Standard — ESM/TypeScript projects)
# Install (once per project) — use vitest-runner for ESM/TypeScript
pnpm add -D @stryker-mutator/core @stryker-mutator/vitest-runner vitest
// stryker.config.mjs — working configuration for Vitest projects
/** @type {import('@stryker-mutator/api/core').PartialStrykerOptions} */
export default {
testRunner: 'vitest',
vitest: {
configFile: 'vitest.config.ts', // optional: path to your vitest config
related: true, // default: run only tests related to mutated file
},
thresholds: { high: 80, low: 60, break: 50 },
reporters: ['html', 'progress'],
};
# Run mutation tests (use incremental to speed up local loops)
pnpm stryker run --incremental
# Target threshold: >80% mutation score
# Score = (killed mutations / total mutations) × 100
Vitest runner limitations (StrykerJS 7.x):
- Browser Mode not supported — threads mode only
- Always uses
perTestcoverage analysis (ignorescoverageAnalysisconfig) - For
.cjsfiles usingnode --test, use@stryker-mutator/jest-runneras fallback
Stryker + node:test (CommonJS/.cjs projects)
pnpm add -D @stryker-mutator/core @stryker-mutator/jest-runner
Interpret results:
- Killed — test suite caught the mutation ✓
- Survived — test suite MISSED this code path (add assertion)
- No coverage — no test exercises this line at all (add test)
When to run: after completing a TDD cycle for security-critical code (hooks, validators, routing logic). Not required for all code — prioritize by risk.
Agent-Studio priority targets for mutation testing:
.claude/hooks/routing/routing-guard.cjs.claude/hooks/safety/unified-creator-guard.cjs.claude/lib/routing/routing-table.cjs