Systematic Debugging

Overview

Debugging is investigation, not experimentation. This skill enforces a rigorous 4-phase process — root cause investigation, pattern analysis, hypothesis testing, and architecture questioning — that prevents shotgun debugging and ensures every fix is understood before it is applied.

Announce at start: "I'm using the systematic-debugging skill to investigate this issue."

Core Principle

┌─────────────────────────────────────────────────────────────────┐
│  HARD-GATE: NEVER GUESS. NEVER SHOTGUN DEBUG.                  │
│  NEVER CHANGE CODE WITHOUT UNDERSTANDING WHY IT IS BROKEN.     │
│                                                                 │
│  You are a detective gathering evidence, not a gambler trying   │
│  random fixes. If you are changing code without understanding   │
│  the root cause, STOP immediately.                             │
└─────────────────────────────────────────────────────────────────┘

Phase 1: Root Cause Investigation

Goal: Understand exactly WHAT is happening, not what you think is happening.

Actions

Read the error message carefully. The entire message. Every line. Including the stack trace.
Reproduce the bug. If you cannot reproduce it, you cannot fix it. Find the exact steps.
Gather evidence. Collect:
- Full error message and stack trace
- Input that triggers the bug
- Expected behavior vs actual behavior
- Environment details (versions, config, OS)
Check recent changes. What changed since this last worked?
- Recent commits (git log, git diff)
- Dependency updates
- Configuration changes
- Environment changes

Evidence Gathering Checklist

Full error message captured (not truncated)
Stack trace read from bottom to top
Bug reproduced reliably with specific steps
Expected vs actual behavior documented
Recent changes reviewed (git log --oneline -20)
Relevant logs examined

STOP — HARD-GATE: Do NOT proceed to Phase 2 until:

You can reproduce the bug consistently
You have the full error message and stack trace
You know what changed recently
You can describe the bug precisely (not vaguely)

Phase 2: Pattern Analysis

Goal: Narrow down WHERE the problem lives and WHEN it occurs.

Actions

Find working examples. Does this feature work in other contexts? With other inputs? In other environments?
Compare working vs broken. What is different between the case that works and the case that does not?
Check dependencies. Are all required services/libraries/configs present and correct?
Isolate the scope. Can you reproduce with a minimal example? Strip away everything non-essential.

Comparison Matrix

Fill this out to identify the pattern:

Factor	Working Case	Broken Case	Different?
Input data
Environment
Configuration
Dependencies
Timing/order
User/permissions
State/context

STOP — HARD-GATE: Do NOT proceed to Phase 3 until:

You have identified at least one working case for comparison
You have compared working vs broken and identified differences
You have isolated the scope to the smallest reproducible case
Dependencies have been verified (versions, availability, config)

Phase 3: Hypothesis and Testing

Goal: Form ONE specific, testable hypothesis and verify it with the smallest possible change.

Actions

Form ONE hypothesis. Based on evidence from Phases 1-2, what is the single most likely cause?
- State it explicitly: "The bug occurs because [specific cause]"
- If you cannot state it specifically, go back to Phase 1 or 2
Design a minimal test. What is the smallest change to confirm or deny this hypothesis?
- Prefer adding a test case over modifying production code
- Prefer logging/assertions over code changes
- Prefer reverting a change over writing new code
Apply the change and test.
- Make ONLY the change needed to test the hypothesis
- Run the test suite
- Observe the result
Evaluate.
- If CONFIRMED: proceed with the fix, write a regression test
- If DENIED: record what you learned, form a new hypothesis, return to step 1

Hypothesis Log Template

Hypothesis #1: [description]
Test: [what you did]
Result: CONFIRMED / DENIED
Learning: [what this taught you]

Hypothesis #2: ...

Decision Table: Hypothesis Testing Approach

Hypothesis Type	Testing Method	Example
Recent code change caused it	`git bisect` or revert commit	"The bug was introduced in commit abc123"
Data shape mismatch	Add logging/assertion	"The API returns null instead of array"
Race condition	Add timing logs or serialize	"Request B completes before request A"
Configuration error	Compare configs across environments	"Production uses different DB host"
Dependency version issue	Lock to known-good version	"Library 2.0 changed the API surface"

STOP — HARD-GATE: Do NOT proceed to Phase 4 unless:

You have tested at least 3 hypotheses and ALL were denied
Each hypothesis was specific and testable
Each test was minimal (one change at a time)
You recorded learnings from each failed hypothesis

Phase 4: Architecture Questioning

Goal: If 3+ hypotheses have failed, the problem may be structural. Step back and question assumptions.

This phase is triggered ONLY after Phase 3 has been attempted at least 3 times without success.

Actions

Question your assumptions. What have you been assuming is true that might not be?
- Is the data shaped the way you think it is?
- Is the control flow what you expect?
- Are the types what you think they are?
- Is the API contract what you assumed?
Question the design. Is the current approach fundamentally flawed?
- Is there a race condition in the design?
- Is there a state management problem?
- Is there an incorrect abstraction?
- Are responsibilities misplaced?
Consider redesign. Sometimes the fix is not a patch but a restructuring.
- Can you simplify the design to eliminate the bug class entirely?
- Is there a pattern that handles this case better?
- Should you replace rather than fix?
Seek external input. If you are stuck:
- Explain the problem to someone else (rubber duck debugging)
- Search for known issues in dependencies
- Check if others have encountered similar problems

STOP — HARD-GATE: Do NOT continue without:

Written list of assumptions that were questioned
Explicit decision: patch the current design OR redesign
If redesigning: a plan before implementing
If patching: a new hypothesis informed by the assumption review

Debugging Decision Flowchart

Error encountered
    |
    v
Can you reproduce it?
    |
    +-- NO --> Gather more information (logs, user reports, monitoring)
    |          Try different inputs, environments, timing
    |          Do NOT proceed until reproducible
    |
    +-- YES -> Read the FULL error message and stack trace
               |
               v
         Is the cause obvious from the error?
               |
               +-- YES -> Form hypothesis, test it (Phase 3)
               |          Still write a regression test
               |
               +-- NO --> Complete Phase 1 evidence gathering
                          |
                          v
                    Find working case for comparison (Phase 2)
                          |
                          v
                    Identify differences
                          |
                          v
                    Form and test hypotheses (Phase 3)
                          |
                          +-- Fixed --> Write regression test, verify
                          |
                          +-- 3+ failed hypotheses --> Phase 4

Red Flags Table

Red Flag	What It Means	Action
Changing code without understanding the bug	Shotgun debugging	Go back to Phase 1
Fix works but you do not know why	Accidental fix, likely to regress	Investigate until you understand
Same bug keeps coming back	Root cause not addressed	Go to Phase 4, question design
Fix causes new bugs elsewhere	Unexpected coupling	Map dependencies before proceeding
"It works on my machine"	Environment difference	Go to Phase 2, comparison matrix
Fix requires more than 20 lines	Might be a design issue	Go to Phase 4
Debugging for 30+ minutes	Tunnel vision	Take a break, re-read evidence from Phase 1
Reading the same code repeatedly	Missing something fundamental	Get a fresh perspective, explain aloud
Multiple causes seem equally likely	Insufficient investigation	Go back to Phase 1, gather more evidence

Anti-Patterns / Common Mistakes

Anti-Pattern	Why It Is Wrong	Correct Approach
Changing random things to see if bug goes away	Wastes time, introduces new bugs	Form a hypothesis first
Adding try/catch to suppress the error	Hides the real problem	Fix the root cause
Rewriting the feature from scratch	Nuclear option is rarely needed	Isolate and fix the specific issue
Blaming the framework/library without evidence	Usually your code is wrong	Prove the framework bug with minimal repro
Skipping the regression test after fixing	Bug will return	Write the test, always
Fixing symptoms instead of root causes	Patches accumulate, system degrades	Trace to the actual cause
Debugging for 45+ minutes without stepping back	Tunnel vision reduces effectiveness	Take a break, re-read Phase 1 evidence
Ignoring error messages or stack traces	The answer is often in the error	Read every line of the error

Integration Points

Skill	Relationship
`test-driven-development`	Every bug fix MUST include a regression test (RED-GREEN cycle)
`verification-before-completion`	After fixing a bug, verify with fresh evidence
`resilient-execution`	When debugging during task execution, pause task, complete debugging, resume
`code-review`	Review the fix for completeness and side effects
`self-learning`	Record new debugging patterns in learned-patterns.md
`acceptance-testing`	Verify fix does not break acceptance criteria

Quick Reference: What NOT To Do

Do NOT change random things and see if the bug goes away
Do NOT add try/catch to suppress the error
Do NOT rewrite the feature from scratch as a first resort
Do NOT blame the framework/library without evidence
Do NOT skip writing a regression test after fixing
Do NOT fix symptoms instead of root causes
Do NOT debug for more than 45 minutes without stepping back
Do NOT ignore error messages or stack traces

Skill Type

RIGID — The 4-phase process is mandatory and must be followed in order. Each phase has a HARD-GATE that must be satisfied before proceeding. Never change code without understanding why it is broken.