tdd
Test-Driven Development (TDD)
Strict Red-Green-Refactor workflow for robust, self-documenting, production-ready code.
Quick Navigation
| Situation | Go To |
|---|---|
| New to this codebase | Step 1: Explore Environment |
| Know the framework, starting work | Step 2: Select Mode |
| Need the core loop reference | Step 3: Core TDD Loop |
| Complex edge cases to cover | Property-Based Testing |
| Tests are flaky/unreliable | Flaky Test Management |
| Need isolated test environment | Hermetic Testing |
| Measuring test quality | Mutation Testing |
The Three Rules (Robert C. Martin)
- No Production Code without a failing test
- Write Only Enough Test to Fail (compilation errors count)
- Write Only Enough Code to Pass (no optimizations yet)
The Loop: 🔴 RED (write failing test) → 🟢 GREEN (minimal code to pass) → 🔵 REFACTOR (clean up) → Repeat
Step 1: Explore Test Environment
Do NOT assume anything. Explore the codebase first.
Checklist:
- Search for test files:
glob("**/*.test.*"),glob("**/*.spec.*"),glob("**/test_*.py") - Check
package.jsonscripts,Makefile, or CI workflows - Look for config:
vitest.config.*,jest.config.*,pytest.ini,Cargo.toml
Framework Detection:
| Language | Config Files | Test Command |
|---|---|---|
| Node.js | package.json, vitest.config.* |
npm test, bun test |
| Python | pyproject.toml, pytest.ini |
pytest |
| Go | go.mod, *_test.go |
go test ./... |
| Rust | Cargo.toml |
cargo test |
Step 2: Select Mode
| Mode | When | First Action |
|---|---|---|
| New Feature | Adding functionality | Read existing module tests, confirm green baseline |
| Bug Fix | Reproducing issue | Write failing reproduction test FIRST |
| Refactor | Cleaning code | Ensure ≥80% coverage on target code |
| Legacy | No tests exist | Add characterization tests before changing |
Tie-breaker: If coverage <20% or tests absent → use Legacy Mode first.
Mode: New Feature
- Read existing tests for the module
- Run tests to confirm green baseline
- Enter Core Loop for new behavior
- Commits:
test(module): add test for X→feat(module): implement X
Mode: Bug Fix
- Write failing reproduction test (MUST fail before fix)
- Confirm failure is assertion error, not syntax error
- Write minimal fix
- Run full test suite
- Commits:
test: add failing test for bug #123→fix: description (#123)
Mode: Refactor
- Run coverage on the specific function you'll refactor
- If coverage <80% → add characterization tests first
- Refactor in small steps (ONE change → run tests → repeat)
- Never change behavior during refactor
Mode: Legacy Code
- Find Seams - insertion points for tests (Sensing Seams, Separation Seams)
- Break Dependencies - use Sprout Method or Wrap Method
- Add characterization tests (capture current behavior)
- Build safety net: happy path + error cases + boundaries
- Then apply TDD for your changes
→ See references/examples.md for full code examples of each mode.
Step 3: The Core TDD Loop
Before Starting: Scenario List
List all behaviors to cover:
- Happy path cases
- Edge cases and boundaries
- Error/failure cases
- Pessimism: 3 ways this could fail (network, null, invalid state)
🔴 RED Phase
- Write ONE test (single behavior or edge case)
- Use AAA: Arrange → Act → Assert
- Run test, verify it FAILS for expected reason
Checks:
- Is failure an assertion error? (Not
SyntaxError/ModuleNotFoundError) - Can I explain why this should fail?
- If test passes immediately → STOP. Test is broken or feature exists.
🟢 GREEN Phase
- Write minimal code to pass
- Do NOT implement "perfect" solution
- Verify test passes
Checks:
- Is this the simplest solution?
- Can I delete any of this code and still pass?
🔵 REFACTOR Phase
- Look for duplication, unclear names, magic values
- Clean up without changing behavior
- Verify tests still pass
Repeat
Select next scenario, return to RED.
Triangulation: If implementation is too specific (hardcoded), write another test with different inputs to force generalization.
Stop Conditions
| Signal | Response |
|---|---|
| Test passes immediately | Check assertions, verify feature isn't already built |
| Test fails for wrong reason | Fix setup/imports first |
| Flaky test | STOP. Fix non-determinism immediately |
| Slow feedback (>5s) | Optimize or mock external calls |
| Coverage decreased | Add tests for uncovered paths |
Test Distribution: The Testing Trophy
The Testing Trophy (Kent C. Dodds) reflects modern testing reality: integration tests give the best confidence-to-effort ratio.
_____________
/ System \ ← Few, slow, high confidence; brittle (E2E)
/_______________\
/ \
/ Integration \ ← Real interactions between units — **BEST ROI** (Integration)
\ /
\_________________/
\ Unit / ← Fast & cheap but test in isolation (Unit)
\___________/
/ Static \ ← Typecheck, linting — typos/types (Static)
/_____________\
Layer Breakdown
| Layer | What | Tools | When |
|---|---|---|---|
| Static | Type errors, syntax, linting | TypeScript, ESLint | Always on, catches 50%+ of bugs for free |
| Unit | Pure functions, algorithms, utilities | vitest, jest, pytest | Isolated logic with no dependencies |
| Integration | Components + hooks + services together | Testing Library, MSW, Testcontainers | Real user flows, real(ish) data |
| E2E | Full app in browser | Playwright, Cypress | Critical paths only (login, checkout) |
Why Integration Tests Win
Unit tests prove code works in isolation. Integration tests prove code works together.
| Concern | Unit Test | Integration Test |
|---|---|---|
| Component renders | ✅ | ✅ |
| Component + hook works | ❌ | ✅ |
| Component + API works | ❌ | ✅ |
| User flow works | ❌ | ✅ |
| Catches real bugs | Sometimes | Usually |
The insight: Most bugs live in the seams between modules, not inside pure functions. Integration tests catch seam bugs; unit tests don't.
Practical Guidance
- Start with integration tests - Test the way users use your code
- Drop to unit tests for complex algorithms or edge cases
- Use E2E sparingly - Slow, flaky, expensive to maintain
- Let static analysis do the heavy lifting - TypeScript catches more bugs than most unit tests
- Prefer fakes over mocks - Fakes have real behavior; mocks just return canned data
- SMURF quality: Sustainable, Maintainable, Useful, Resilient, Fast
Anti-Patterns
| Pattern | Problem | Fix |
|---|---|---|
| Mirror Blindness | Same agent writes test AND code | State test intent before GREEN |
| Happy Path Bias | Only success scenarios | Include errors in Scenario List |
| Refactoring While Red | Changing structure with failing tests | Get to GREEN first |
| The Mockery | Over-mocking hides bugs | Prefer fakes or real implementations |
| Coverage Theater | Tests without meaningful assertions | Assert behavior, not lines |
| Multi-Test Step | Multiple tests before implementing | One test at a time |
| Verification Trap 🤖 | AI tests what code does not what it should do | State intent in plain language; separate agent review |
| Test Exploitation 🤖 | LLMs exploit weak assertions or overload operators | Use PBT alongside examples; strict equality |
| Assertion Omission 🤖 | Missing edge cases (null, undefined, boundaries) | Scenario list with errors; test.each |
| Hallucinated Mock 🤖 | AI generates fake mocks without proper setup | Testcontainers for integration; real Fakes for unit |
Critical: Verify tests by (1) running them, (2) having separate agent review, (3) never trusting generated tests blindly.
Advanced Techniques
Use these techniques at specific points in your workflow:
| Technique | Use During | Purpose |
|---|---|---|
| Test Doubles | 🔴 RED phase | Isolate dependencies when writing tests |
| Property-Based Testing | 🔴 RED phase | Cover edge cases for complex logic |
| Contract Testing | 🔴 RED phase | Define API expectations between services |
| Snapshot Testing | 🔴 RED phase | Capture UI/response structure |
| Hermetic Testing | 🔵 Setup | Ensure test isolation and determinism |
| Mutation Testing | ✅ After GREEN | Validate test suite effectiveness |
| Coverage Analysis | ✅ After GREEN | Find untested code paths |
| Flaky Test Management | 🔧 Maintenance | Fix unreliable tests blocking CI |
Test Doubles (Use: Writing Tests with Dependencies)
When: Your code depends on something slow, unreliable, or complex (DB, API, filesystem).
| Type | Purpose | When |
|---|---|---|
| Stub | Returns canned answers | Need specific return values |
| Mock | Verifies interactions | Need to verify calls made |
| Fake | Simplified implementation | Need real behavior without cost |
| Spy | Records calls | Need to observe without changing |
Decision: Dependency slow/unreliable? → Fake (complex) or Stub (simple). Need to verify calls? → Mock/Spy. Otherwise → real implementation.
→ See references/examples.md → Test Double Examples
Hermetic Testing (Use: Test Environment Setup)
When: Setting up test infrastructure. Tests must be isolated and deterministic.
Principles:
- Isolation: Unique temp directories/state per test
- Reset: Clean up in setUp/tearDown
- Determinism: No time-based logic or shared mutable state
Database Strategies:
| Strategy | Speed | Fidelity | Use When |
|---|---|---|---|
| In-memory (SQLite) | Fast | Low | Unit tests, simple queries |
| Testcontainers | Medium | High | Integration tests |
| Transactional Rollback | Fast | High | Tests sharing schema (80x faster than TRUNCATE) |
→ See references/examples.md → Hermetic Testing Examples
Property-Based Testing (Use: Writing Tests for Complex Logic)
When: Writing tests for algorithms, state machines, serialization, or code with many edge cases.
Tools: fast-check (JS/TS), Hypothesis (Python), proptest (Rust)
Properties to Test:
- Commutativity:
f(a, b) == f(b, a) - Associativity:
f(f(a, b), c) == f(a, f(b, c)) - Identity:
f(a, identity) == a - Round-trip:
decode(encode(x)) == x - Metamorphic: If input changes by X, output changes by Y (useful when you don't know expected output)
How: Replace multiple example-based tests with one property test that generates random inputs.
Critical: Always log the seed on failure. Without it, you cannot reproduce the failing case.
→ See references/examples.md → Property-Based Testing Examples
Mutation Testing (Use: Validating Test Quality)
When: After tests pass, to verify they actually catch bugs. Use for critical code (auth, payments) or before major refactors.
Tools: Stryker (JS/TS), PIT (Java), mutmut (Python)
How: Tool mutates your code (e.g., changes > to >=). If tests still pass → your tests are weak.
Interpretation:
- >80% mutation score = good test suite
- Survived mutants = tests don't catch those changes → add tests for these
Equivalent Mutant Problem: Some mutants change syntax but not behavior (e.g., i < 10 → i != 10 in a loop where i only increments). These can't be killed—100% score is often impossible. Focus on surviving mutants in critical paths, not chasing perfect scores.
When NOT to use: Tool-generated code (OpenAPI clients, Protobuf stubs, ORM models), simple DTOs/getters, legacy code with slow tests, or CI pipelines that must finish in <5 minutes. Use --incremental --since main for PR-focused runs. Note: This does NOT mean skip mutation testing on code you (the agent) wrote—always validate your own work.
→ See references/examples.md → Mutation Testing Examples
Flaky Test Management (Use: CI/CD Maintenance)
When: Tests fail intermittently, blocking CI or eroding trust in the test suite.
Root Causes:
| Cause | Fix |
|---|---|
Timing (setTimeout, races) |
Fake timers, await properly |
| Shared state | Isolate per test |
| Randomness | Seed or mock |
| Network | Use MSW or fakes |
| Order dependency | Make tests independent |
| Parallel transaction conflicts | Isolate DB connections per worker |
How: Detect (--repeat 10) → Quarantine (separate suite) → Fix root cause → Restore
Quarantine Rules:
- Issue-linked: Every quarantined test MUST link to a tracking issue. Prevents "quarantine-and-forget."
- Mute, don't skip: Prefer muting (runs but doesn't fail build) over skipping. You still collect failure data.
- Reintroduction criteria: Test must pass N consecutive runs (e.g., 100) on main before leaving quarantine.
→ See references/examples.md → Flaky Test Examples
Contract Testing (Use: Writing Tests for Service Boundaries)
When: Writing tests for code that calls or exposes APIs. Prevents integration breakage.
How (Pact): Consumer defines expected interactions → Contract published → Provider verifies → CI fails if contract broken.
→ See references/examples.md → Contract Testing Examples
Coverage Analysis (Use: Finding Gaps After Tests Pass)
When: After writing tests, to find untested code paths. NOT a goal in itself.
| Metric | Measures | Threshold |
|---|---|---|
| Line | Lines executed | 70-80% |
| Branch | Decision paths | 60-70% |
| Mutation | Test effectiveness | >80% |
Risk-Based Prioritization: P0 (auth, payments) → P1 (core logic) → P2 (helpers) → P3 (config)
Warning: High coverage ≠ good tests. Tests must assert meaningful behavior.
Snapshot Testing (Use: Writing Tests for UI/Output Structure)
When: Writing tests for UI components, API responses, or error message formats.
Appropriate: UI structure, API response shapes, error formats. Avoid: Behavior testing, dynamic content, entire pages.
How: Capture output once, verify it doesn't change unexpectedly. Always review diffs carefully.
→ See references/examples.md → Snapshot Testing Examples
Integration with Other Skills
| Task | Skill | Usage |
|---|---|---|
| Committing | git-commit |
test: for RED, feat: for GREEN |
| Code Quality | code-quality |
Run during REFACTOR phase |
| Documentation | docs-check |
Check if behavior changes need docs |
References
Foundational:
- Three Rules of TDD - Robert C. Martin
- Test Pyramid - Martin Fowler
- Testing Trophy - Kent C. Dodds
- Working Effectively with Legacy Code - Michael Feathers
Tools: Testcontainers | fast-check | Stryker | MSW | Pact
More from lukasstrickler/ai-dev-atelier
ui-animation
Guide tasteful UI animation with easing, springs, layout animations, gestures, and accessibility. Covers Tailwind and Motion patterns. Use when: (1) Implementing enter/exit animations, (2) Choosing easing curves, (3) Configuring springs, (4) Layout animations and shared elements, (5) Drag/swipe gestures, (6) Micro-interactions, (7) Ensuring prefers-reduced-motion accessibility. Triggers: animate, animation, easing, spring, transition, motion, layout, gesture, drag, swipe, reduced motion, framer motion.
31use-graphite
Manage stacked PRs with Graphite CLI (gt) instead of git push/gh pr create. Auto-detects Graphite repos and blocks conflicting commands with helpful alternatives. Use when: (1) About to run git push or gh pr create in a Graphite repo, (2) Creating a new branch for a feature, (3) Submitting code for review, (4) Large changes that should be split into reviewable chunks, (5) Hook blocks your git command and suggests gt equivalent. NOT for: repos not initialized with Graphite, git add/commit/status/log. Triggers: git push blocked, gh pr create blocked, create branch, submit PR, stacked PRs, split large PR, gt create, gt submit, graphite workflow.
14code-quality
Run comprehensive code quality checks including TypeScript typecheck, ESLint linting, Prettier formatting, and Markdown validation. Auto-fixes formatting issues in agent mode or provides read-only checks for CI pipelines. Use when: (1) Before committing code changes, (2) In CI/CD pipelines for automated quality gates, (3) After making significant code changes, (4) When preparing code for review, (5) When ensuring code meets quality standards, (6) For type checking, linting, formatting, and markdown validation, (7) In pre-commit hooks, or (8) For automated quality gates before merging. Triggers: finalize, code quality, typecheck, lint, format, check code, quality check, run checks, pre-commit, before commit, CI checks, validate code.
9git-commit
Write clear git commits with Conventional Commits format. Detects project conventions from history and config. Guides commit granularity. Use when: (1) Completing working code, (2) Code builds and tests pass, (3) Ready to save, (4) Before pushing, (5) After review feedback. Triggers: automatically when finishing commitable work that builds and passes tests.
8image-generation
Generate, edit, and upscale AI images. Use when creating visual assets for apps, websites, or documentation. FREE Cloudflare tier for iterate generation (~96/day), Fal.ai for paid tiers. Four quality tiers (iterate/default/premium/max). Supports text specialists, multi-ref editing, SVG, background removal. Triggers: generate image, create image, edit image, upscale, logo, picture of, remove background.
7docs-write
Write or update documentation (tutorial, how-to, reference, explanation) with clear style, structure, visuals, API/ADR/runbook patterns. Use when: (1) Creating or updating docs after code changes, (2) During PR preparation or addressing review feedback, (3) Adding new features that need documentation, (4) Updating API endpoints, database schemas, or configuration, (5) Creating ADRs or runbooks, (6) Adding or updating diagrams and visual documentation, (7) When documentation needs to be written or revised, (8) For tutorial creation, how-to guides, or technical writing, or (9) For documentation standards compliance and structure. Triggers: write docs, update documentation, create documentation, write tutorial, document API, write ADR, create runbook, add documentation, document this, write how-to.
7