systematic-debugging
Systematic Debugging Skill
Overview
This skill provides a structured four-phase debugging framework emphasizing root cause discovery before attempting fixes. Core principle: "Random fixes waste time and create new bugs. Quick patches mask underlying issues."
Quick Start
- Investigate - Gather evidence, reproduce consistently
- Analyze - Compare with working patterns
- Hypothesize - Form and test specific theories
- Implement - Fix with test coverage
When to Use
- Bug reports requiring investigation
- Test failures with unclear causes
- Production incidents
- Performance regressions
- Integration failures
- Any debugging that requires more than 5 minutes
The Four Phases
Phase 1: Root Cause Investigation
Objective: Understand the problem completely before attempting any fix.
Steps:
- Examine error messages thoroughly
- Reproduce the issue consistently
- Review recent changes (commits, configs, dependencies)
- Gather diagnostic evidence (logs, traces, metrics)
- For multi-component systems, add instrumentation at each boundary
Questions to answer:
- What exactly is failing?
- When did it start failing?
- What changed recently?
- Can I reproduce it reliably?
Phase 2: Pattern Analysis
Objective: Find working examples and understand differences.
Steps:
- Locate working examples in the codebase
- Compare against reference implementations completely
- Identify differences systematically
- Understand all dependencies
Key comparisons:
- Working vs. broken code paths
- Expected vs. actual behavior
- Known good state vs. current state
Phase 3: Hypothesis and Testing
Objective: Form and validate theories before changing code.
Steps:
- Formulate a specific hypothesis
- Design a test for the hypothesis
- Test with minimal changes (one variable at a time)
- Verify results before proceeding
Hypothesis format: "The bug occurs because [condition] when [trigger], which causes [symptom]."
Phase 4: Implementation
Objective: Fix the root cause with proper verification.
Steps:
- Create a failing test case reproducing the bug
- Implement a single fix addressing the root cause
- Verify the test passes
- Verify no other tests broke
- Document the fix
Critical Safeguards
Hard Stop Rule
If >= 3 fixes fail: STOP and question the architecture.
When multiple fixes fail, the issue indicates deeper structural problems requiring discussion rather than continued symptom-patching.
Red Flags (Restart Process)
- Proposing solutions before investigation
- Attempting multiple simultaneous fixes
- Assuming without verification
- Skipping reproduction step
- "It should work" without evidence
Debugging Anti-Patterns
| Anti-Pattern | Problem | Correct Approach |
|---|---|---|
| Shotgun debugging | Random changes hoping something works | Systematic investigation |
| Printf debugging only | Incomplete picture | Structured instrumentation |
| Blame the framework | Avoids understanding | Verify framework behavior |
| "Works on my machine" | Environment assumptions | Document exact repro steps |
| Quick patch | Hides root cause | Find and fix actual cause |
Instrumentation Strategies
Logging Strategy
1. Entry/exit of suspected functions
2. Input/output values at boundaries
3. State changes at key points
4. Timing information for performance issues
Boundary Tracing
For multi-component systems:
[Input] -> [Component A] -> [Component B] -> [Output]
^ ^ ^ ^
| | | |
Check 1 Check 2 Check 3 Check 4
Add verification at each boundary to isolate failure point.
Best Practices
Do
- Reproduce before investigating
- Document investigation steps
- Test one hypothesis at a time
- Write regression test for every bug fix
- Share findings with team
- Update documentation when environment-related
Don't
- Jump to conclusions
- Make multiple changes at once
- Fix symptoms instead of causes
- Skip the hypothesis step
- Merge fixes without tests
- Ignore intermittent failures
Error Handling
| Situation | Action |
|---|---|
| Cannot reproduce | Gather more context, check environment differences |
| Multiple potential causes | Isolate and test each separately |
| Fix breaks other things | Revert, investigate dependencies |
| Root cause unclear after investigation | Escalate, add more instrumentation |
Metrics
| Metric | Target | Description |
|---|---|---|
| First-fix success rate | >80% | Fixes that resolve issue first time |
| Regression rate | <5% | Bug fixes causing new bugs |
| Investigation time ratio | >60% | Time spent investigating vs. coding |
| Documentation rate | 100% | Bugs documented with root cause |
Debugging Checklist
- Issue reproduced consistently
- Recent changes reviewed
- Error messages fully understood
- Working comparison found
- Hypothesis documented
- Single-variable test performed
- Root cause identified
- Failing test written
- Fix implemented
- All tests pass
- Fix documented
Related Skills
- tdd-obra - Test-first development
- writing-plans - Plan implementations
- code-reviewer - Code quality review
Version History
- 1.0.0 (2026-01-19): Initial release adapted from obra/superpowers