systematic-debugging
Systematic Debugging
Overview
Random fixes waste time and create new bugs. Quick patches mask underlying issues.
Core principle: Root cause investigation must precede any fix attempt. Symptom fixes represent process failure.
Violating the letter of this process is violating the spirit of debugging.
The Iron Law
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
Fixes cannot be proposed without completing Phase 1.
When to Apply
Systematic debugging applies to ANY technical issue:
- Test failures
- Bugs in production
- Unexpected behavior
- Performance problems
- Build failures
- Integration issues
Especially valuable when:
- Time pressure creates temptation to guess
- "Quick fix" seems obvious
- Multiple fixes have already been attempted
- Previous fixes failed
- Issue is not fully understood
Process should not be skipped even when:
- Issue appears simple (simple bugs have root causes too)
- Time is tight (systematic approach is faster than thrashing)
- Urgency exists (investigation is faster than rework)
The Four Phases
Each phase must be completed before proceeding to the next.
Phase 1: Root Cause Investigation
Before attempting any fix:
-
Read Error Messages Carefully
- Error messages and warnings often contain solutions
- Read stack traces completely
- Note line numbers, file paths, error codes
-
Reproduce Consistently
- Determine if the issue triggers reliably
- Identify exact steps
- Confirm reproducibility
- If not reproducible, gather more data instead of guessing
-
Check Recent Changes
- Git diff and recent commits
- New dependencies, config changes
- Environmental differences
-
Gather Evidence in Multi-Component Systems
For systems with multiple components (CI -> build -> signing, API -> service -> database):
Diagnostic instrumentation should be added before proposing fixes:
For EACH component boundary: - Log what data enters component - Log what data exits component - Verify environment/config propagation - Check state at each layer Run once to gather evidence showing WHERE it breaks THEN analyze evidence to identify failing component THEN investigate that specific componentMulti-layer system example:
# Layer 1: Workflow echo "=== Secrets available in workflow: ===" echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}" # Layer 2: Build script echo "=== Env vars in build script: ===" env | grep IDENTITY || echo "IDENTITY not in environment" # Layer 3: Signing script echo "=== Keychain state: ===" security list-keychains security find-identity -v # Layer 4: Actual signing codesign --sign "$IDENTITY" --verbose=4 "$APP"This reveals which layer fails.
-
Trace Data Flow
When error is deep in call stack:
See
./references/root-cause-tracing.mdfor the complete backward tracing technique.Quick approach:
- Identify where bad value originates
- Determine what called this with bad value
- Continue tracing up until source is found
- Fix at source, not at symptom
Phase 2: Pattern Analysis
Pattern identification should precede any fix:
-
Find Working Examples
- Locate similar working code in same codebase
- Identify working code similar to what's broken
-
Compare Against References
- If implementing a pattern, read reference implementation completely
- Read every line, do not skim
- Understand pattern fully before applying
-
Identify Differences
- List every difference between working and broken code
- Do not dismiss small differences as irrelevant
-
Understand Dependencies
- Other components required by this operation
- Settings, config, environment needed
- Assumptions made by the pattern
Phase 3: Hypothesis and Testing
Scientific method application:
-
Form Single Hypothesis
- State clearly: "X is the root cause because Y"
- Be specific, not vague
-
Test Minimally
- Make smallest possible change to test hypothesis
- One variable at a time
- Do not fix multiple things simultaneously
-
Verify Before Continuing
- If hypothesis confirmed: proceed to Phase 4
- If not confirmed: form new hypothesis
- Do not add more fixes on top
-
When Understanding is Missing
- Acknowledge lack of understanding
- Ask for help
- Research more
Phase 4: Implementation
Fix the root cause, not the symptom:
-
Create Failing Test Case
- Simplest possible reproduction
- Automated test if possible
- One-off test script if no framework
- Test must exist before fixing
-
Implement Single Fix
- Address the identified root cause
- One change at a time
- No "while I'm here" improvements
- No bundled refactoring
-
Verify Fix
- Test now passes?
- No other tests broken?
- Issue actually resolved?
-
If Fix Doesn't Work
- Stop
- Count attempted fixes
- If < 3: Return to Phase 1, re-analyze with new information
- If >= 3: Question architecture
-
Architecture Questioning After 3+ Failed Fixes
Patterns indicating architectural problem:
- Each fix reveals new shared state/coupling/problem in different place
- Fixes require "massive refactoring"
- Each fix creates new symptoms elsewhere
Stop and question fundamentals:
- Is pattern fundamentally sound?
- Is approach continuing through inertia?
- Should architecture be refactored vs. fixing symptoms?
Discuss with human partner before attempting more fixes
This is not a failed hypothesis - this is wrong architecture.
Complex Bugs and Planning
For complex bugs, planning must precede any code changes:
When Bug is Complex
A bug requires EnterPlanMode before making changes when ANY of these apply:
- Multi-component involvement - Issue spans multiple files, modules, or subsystems
- Architecture implications - Fix may affect system design, contracts, or interfaces
- Multiple potential approaches - Several valid implementation paths exist
- Side-effect risk - Change could impact unrelated functionality
- Requires refactoring - Fix needs structural changes beyond minimal patch
- Not fully understood - After Phase 1 investigation, root cause is still unclear
Planning Process
-
Complete Phase 1 (Root Cause Investigation)
- Must understand WHAT is broken and WHY before planning
- Gather all evidence first
-
Use EnterPlanMode tool
- This signals to user you need approval before proceeding
- Allows user to review approach before implementation
-
Write implementation plan covering:
- Root cause summary (from Phase 1)
- Proposed fix strategy
- Files that will be modified
- Tests to be created/modified
- Potential risks and mitigation
- Alternative approaches considered
-
Wait for user approval
- User may suggest different approach
- User may provide additional context
- User may approve as-is
Why Planning for Complex Bugs
- Prevents expensive rework from wrong architectural choices
- Ensures alignment with user preferences and constraints
- Catches overlooked dependencies early
- Provides visibility into proposed changes before execution
For simple bugs: Continue with Phase 2-4 directly without planning.
Red Flags
These mental patterns indicate process violation and require returning to Phase 1:
- "Quick fix for now, investigate later"
- "Just try changing X and see if it works"
- "Add multiple changes, run tests"
- "Skip the test, I'll manually verify"
- "It's probably X, let me fix that"
- "I don't fully understand but this might work"
- "Pattern says X but I'll adapt it differently"
- Proposing solutions before tracing data flow
- "One more fix attempt" (when already tried 2+)
- Each fix reveals new problem in different place
If 3+ fixes failed: Question the architecture.
Human Partner Signals
Watch for these redirections:
- "Is that not happening?" - Indicates assumption without verification
- "Will it show us...?" - Indicates missing evidence gathering
- "Stop guessing" - Indicates proposing fixes without understanding
- "Ultrathink this" - Indicates need to question fundamentals, not just symptoms
- "We're stuck?" (frustrated) - Indicates current approach isn't working
When encountering these signals: Return to Phase 1.
Common Rationalizations
| Excuse | Reality |
|---|---|
| "Issue is simple, don't need process" | Simple issues have root causes too. Process is fast for simple bugs. |
| "Emergency, no time for process" | Systematic debugging is FASTER than guess-and-check thrashing. |
| "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. |
| "I'll write test after confirming fix works" | Untested fixes don't stick. Test first proves it. |
| "Multiple fixes at once saves time" | Can't isolate what worked. Causes new bugs. |
| "Reference too long, I'll adapt the pattern" | Partial understanding guarantees bugs. Read it completely. |
| "I see the problem, let me fix it" | Seeing symptoms != understanding root cause. |
| "One more fix attempt" (after 2+ failures) | 3+ failures = architectural problem. Question pattern, don't fix again. |
Quick Reference
| Phase | Key Activities | Success Criteria |
|---|---|---|
| 1. Root Cause | Read errors, reproduce, check changes, gather evidence | Understand WHAT and WHY |
| 2. Pattern | Find working examples, compare | Identify differences |
| 3. Hypothesis | Form theory, test minimally | Confirmed or new hypothesis |
| 4. Implementation | Create test, fix, verify | Bug resolved, tests pass |
When Process Reveals No Root Cause
If systematic investigation reveals issue is environmental, timing-dependent, or external:
- Process has been completed
- Document what was investigated
- Implement appropriate handling (retry, timeout, error message)
- Add monitoring/logging for future investigation
Note: 95% of "no root cause" cases represent incomplete investigation.
Supporting Techniques
These techniques are part of systematic debugging:
./references/root-cause-tracing.md- Trace bugs backward through call stack to find original trigger./references/defense-in-depth.md- Add validation at multiple layers after finding root cause./references/condition-based-waiting.md- Replace arbitrary timeouts with condition polling./references/condition-based-waiting-example.ts- Example implementation of condition-based waiting
Related skills:
- superpowers:behavior-driven-development - BDD principles including Gherkin scenarios for test design
Real-World Impact
From debugging sessions:
- Systematic approach: 15-30 minutes to fix
- Random fixes approach: 2-3 hours of thrashing
- First-time fix rate: 95% vs 40%
- New bugs introduced: Near zero vs common