systematic-debugging
Systematic Debugging
You are investigating a bug using a structured four-phase process. The goal is to find and fix the root cause, not just suppress the symptom.
When to Activate
- Bug reports or unexpected behavior
- Failing tests that shouldn't be failing
- Production issues or error reports
- "It works on my machine" type problems
- Any situation where the cause isn't immediately obvious
Activation
After activation, print the banner (see _shared/observability.md):
---
**Systematic Debugging** activated
Trigger: [describe the symptom in your own words — e.g., "test failure in auth flow" or "unexpected 500 on endpoint"]
Produces: root cause analysis, regression test, fix
---
Phase 1: Reproduce
Narrate: Phase 1/4: Reproducing the problem...
Before anything else, reliably reproduce the problem.
Steps
- Understand the report — What's expected? What's happening instead? When did it start?
- Reproduce locally — Run the exact steps that trigger the bug
- Document the reproduction:
## Reproduction Steps: [exact steps] Expected: [what should happen] Actual: [what happens instead] Frequency: [always / intermittent / specific conditions] Environment: [OS, Node version, browser, etc.] - If intermittent: Identify conditions that increase frequency (load, timing, specific data)
If You Can't Reproduce
- Check environment differences (versions, config, data)
- Add logging to narrow down when/where it occurs
- Ask the reporter for more details
- Do NOT proceed to fixing without reproduction — you'll likely fix the wrong thing
If reproduction fails after reasonable effort, use error recovery (see _shared/observability.md). AskUserQuestion with options: "Add more logging and retry / Get more info from reporter / Proceed with best guess (risky)."
Narrate: Phase 1/4: Reproducing the problem... done
Phase 2: Isolate
Narrate: Phase 2/4: Isolating the bug...
Narrow down where the bug lives.
Binary Search Strategy
- Identify the boundaries — Which systems/modules are involved in the failing flow?
- Add assertions at boundaries — Verify inputs and outputs at each layer
- Bisect — Is the problem in the frontend or backend? In the handler or the model? In the query or the transformation?
- Narrow until you find the exact location — The specific function, line, or condition
Isolation Techniques
- Git bisect — If the bug is a regression, find the commit that introduced it:
git bisect start,git bisect bad,git bisect good [known-good-commit] - Minimal reproduction — Strip away unrelated code until you have the smallest case that fails
- Logging — Add targeted logging (not scattershot
console.logeverywhere) - Assertion insertion — Add
assertstatements at key points to catch violations
What to Record
## Isolation
Location: [file:line or module]
Narrowed from: [original scope] → [isolated location]
Method: [how you narrowed it down]
Narrate: Phase 2/4: Isolating the bug... done
Phase 3: Analyze Root Cause
Narrate: Phase 3/4: Analyzing root cause...
Understand why the bug exists, not just where.
Root Cause Categories
- Logic error — Code does the wrong thing (off-by-one, wrong operator, missing case)
- State corruption — Data gets into an invalid state (race condition, missing validation, stale cache)
- Integration mismatch — Two components disagree on interface, format, or protocol
- Environment issue — Config, version, or platform difference
- Missing handling — Edge case, error path, or boundary not covered
Analysis Questions
- Why does this code exist? What was the original intent?
- What changed recently? (Check git log for the area)
- Are there similar patterns elsewhere that work correctly? What's different?
- Is this the only place this bug manifests, or is it a systemic issue?
Log the root cause decision (see _shared/observability.md Decision Log format):
Decision: Root cause is [category] — [specific cause] Reason: [evidence from isolation phase] Alternatives: [other hypotheses that were ruled out]
Defense in Depth
After identifying the root cause, ask:
- Why wasn't this caught by tests?
- Why wasn't this caught by types?
- Why wasn't this caught by code review?
- What other similar bugs might exist?
## Root Cause
Cause: [precise description]
Category: [logic / state / integration / environment / missing handling]
Why it wasn't caught: [gap in testing, types, or review]
Related risks: [other places with similar patterns]
Narrate: Phase 3/4: Analyzing root cause... done
Phase 4: Fix
Narrate: Phase 4/4: Applying fix...
Fix the root cause and prevent regression.
Fix Process
- Write a failing test that reproduces the bug (TDD red phase)
- Verify the test fails — It should fail for the same reason as the bug
- Apply the minimal fix — Change as little as possible
- Verify the test passes — The fix resolves the bug
- Run the full test suite — The fix doesn't break anything else
- Check for related issues — Fix similar patterns elsewhere if applicable
Log defense-in-depth decisions:
Decision: [which defense-in-depth measures to add] Reason: [what class of bug this prevents] Alternatives: [other measures considered]
Defense-in-Depth Fixes
Beyond the immediate fix, consider:
- Add type constraints that would prevent this class of bug
- Add validation at the boundary where bad data enters
- Add assertions that would catch this sooner
- Improve error messages so the next person gets a clearer signal
Condition-Based Waiting
If the bug involves timing or async behavior:
- Never use arbitrary delays (
sleep 5,setTimeout(5000)) - Wait for conditions: poll for the expected state, with a timeout
- Example: Instead of
sleep 2 && check_result, usei=0; while ! check_result && [ $i -lt 100 ]; do sleep 0.1; i=$((i+1)); done
Fix Report
## Fix
Change: [what was changed and why]
Files: [list of modified files]
Test: [the regression test that was added]
Defense: [additional protective measures added]
Related fixes: [similar patterns fixed elsewhere, or "none"]
Narrate: Phase 4/4: Applying fix... done
Completion
Present the full debugging report:
**Debugging summary.**
Root cause: [one-line root cause]
Fix: [one-line fix description]
Bug: [one-line description]
Reproduction → Isolation → Analysis → Fix
[Link to each section above]
Regression test: [test name and how to run it]
Tests: All passing
Build: Clean
Handoff
After the completion report, print this completion marker exactly:
**Debugging complete.**
Artifacts:
- Regression test: [test file path]
- Files changed: [list]
- Defense-in-depth: [additions, or "none"]
Compound learning: [If this revealed a pattern, it should be captured via compound-learnings]
Returning to the calling workflow.
Rules
- Never fix a bug you can't reproduce — you're guessing
- Never suppress a symptom without understanding the root cause
- Always add a regression test — bugs that come back are morale killers
- Log your debugging process — it helps others (and future you) with similar bugs
- Don't scatter
console.log— add targeted, informative logging - Remove debugging artifacts (extra logs, temporary assertions) before shipping, unless they improve production observability
- If the bug is in a dependency, document the workaround and file an upstream issue