debugging-mastery
SKILL.md
Debugging Mastery
Systematic debugging methodology. No guessing—scientific method applied to code.
The Debugging Mindset
Debugging is not about fixing. It's about understanding.
Wrong approach:
Bug reported → Try random fixes → Hope it works → Ship
Right approach:
Bug reported → Reproduce → Isolate → Understand → Fix → Verify → Prevent
Senior engineers spend 80% of debugging time understanding the problem and 20% fixing it. Juniors invert this ratio and end up spending more time overall.
The Scientific Method for Debugging
┌─────────────────────────────────────────────────────────────────┐
│ THE DEBUGGING LOOP │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. OBSERVE │
│ └── What exactly is happening? (not what you think) │
│ │
│ 2. HYPOTHESIZE │
│ └── What could cause this? (list multiple hypotheses) │
│ │
│ 3. PREDICT │
│ └── If hypothesis X is true, what else should be true? │
│ │
│ 4. TEST │
│ └── Design experiment to confirm/refute hypothesis │
│ │
│ 5. CONCLUDE │
│ └── Was hypothesis correct? If not, return to step 2 │
│ │
└─────────────────────────────────────────────────────────────────┘
Phase 1: Reproduce
If you can't reproduce it, you can't fix it.
Reproduction Checklist
- [ ] Can I reproduce locally?
- [ ] Can I reproduce consistently (not flaky)?
- [ ] What's the minimal reproduction case?
- [ ] What are the exact steps?
- [ ] What's the environment? (OS, versions, config)
- [ ] What's the input data?
- [ ] What's the expected vs actual behavior?
Reproduction Strategies
| Situation | Strategy |
|---|---|
| Can reproduce locally | Ideal—proceed to isolate |
| Only in production | Add logging, capture state, replicate data |
| Intermittent/flaky | Increase frequency: loops, load, timing changes |
| User-reported | Get exact steps, screenshots, network logs |
| Race condition | Add delays, increase concurrency, stress test |
The Minimal Reproduction
Strip away everything unnecessary:
Full system: Minimal repro:
┌────────────┐ ┌────────────┐
│ Frontend │ │ │
│ + API │ ───► │ Single │
│ + Auth │ │ Function │
│ + Cache │ │ + Input │
│ + Database │ │ │
└────────────┘ └────────────┘
Why minimal?
- Fewer variables = faster isolation
- Shareable with others
- Becomes regression test
- Often reveals the bug itself
Phase 2: Isolate
Binary search the problem space.
Binary Search Debugging
Don't: Check every line sequentially (O(n))
Do: Binary search the problem space (O(log n))
┌─────────────────────────────────────────────┐
│ Entire codebase │
└─────────────────────────────────────────────┘
│
┌───────────┴───────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ First half │ │ Second half │
│ (works?) │ │ (works?) │
└─────────────────┘ └─────────────────┘
│
┌────┴────┐
▼ ▼
┌───────┐ ┌───────┐
│ 1/4 │ │ 2/4 │ ← Bug is here
└───────┘ └───────┘
Isolation Techniques
| Technique | How | When |
|---|---|---|
| Git bisect | Binary search commits | Bug worked before, broken now |
| Comment out | Disable code sections | Isolate problematic component |
| Hardcode | Replace variables with constants | Isolate data vs logic |
| Mock | Replace dependencies | Isolate integration vs logic |
| Simplify | Remove features until bug disappears | Find minimum trigger |
Git Bisect Example
git bisect start
git bisect bad # Current commit is broken
git bisect good abc123 # This commit worked
# Git checks out middle commit
# Test and mark:
git bisect good # or git bisect bad
# Repeat until found
git bisect reset
Phase 3: Understand
Don't fix what you don't understand.
Read the Error Message
90% of developers don't actually read error messages.
Error: Cannot read property 'map' of undefined
at UserList (UserList.js:15:23)
at renderWithHooks (react-dom.js:1234)
at mountIndeterminateComponent (react-dom.js:5678)
This tells you:
1. WHAT: Trying to call .map() on undefined
2. WHERE: UserList.js, line 15, column 23
3. CONTEXT: During React render
Don't guess. Read.
Understanding Questions
Before fixing, answer:
- What is the expected behavior?
- What is the actual behavior?
- What changed? (code, data, environment, dependencies)
- Where does the incorrect state originate?
- Why does the code behave this way? (not "why is it broken")
Trace the Data Flow
Input → Transform → Transform → Transform → Output
↓ ↓ ↓ ↓ ↓
✓ ✓ ✗ ? ?
│
Bug is here
(first incorrect state)
Find the first point where data becomes incorrect.
Phase 4: Debug Tooling
Debugger Over console.log
// Junior approach: Spray and pray
console.log('here 1');
console.log('data:', data);
console.log('here 2');
console.log('wtf');
// Senior approach: Strategic breakpoints
debugger; // Pause execution
// Inspect call stack
// Inspect variables
// Step through code
// Watch expressions
When to Use Each Tool
| Tool | Use For |
|---|---|
| Debugger | Understanding control flow, inspecting state |
| Logging | Production issues, async flows, distributed systems |
| Profiler | Performance issues, memory leaks |
| Network tab | API issues, request/response inspection |
| Memory snapshot | Memory leaks, object retention |
| Flame graph | CPU profiling, hot paths |
Strategic Logging
// Bad: Noise
console.log('processing');
console.log(data);
// Good: Structured, contextual, actionable
logger.debug('Processing order', {
orderId: order.id,
itemCount: order.items.length,
totalAmount: order.total,
customerId: order.customerId,
timestamp: Date.now()
});
Phase 5: Common Bug Patterns
The Usual Suspects
| Bug Pattern | Symptom | First Check |
|---|---|---|
| Null/undefined | "Cannot read property X" | Check if value exists before use |
| Off-by-one | Wrong count, missing item | Array bounds, loop conditions |
| Race condition | Intermittent failures | Async timing, shared state |
| State mutation | Unpredictable behavior | Who else modifies this state? |
| Cache staleness | Old data showing | Cache invalidation, TTLs |
| Type coercion | Weird comparisons | === vs ==, string vs number |
| Floating point | Math doesn't add up | Use integers for money, compare with epsilon |
| Timezone | Wrong time/date | Store UTC, convert on display |
| Encoding | Garbled text, URL issues | UTF-8, URL encoding |
| Memory leak | Growing memory, OOM | Event listeners, closures, caches |
Null Reference Checklist
// When you see: Cannot read property 'X' of undefined
// Check the chain:
user.profile.settings.theme
// ↓
user // defined?
user.profile // defined?
user.profile.settings // defined?
// Use optional chaining:
user?.profile?.settings?.theme
// Or guard clauses:
if (!user?.profile?.settings) {
return defaultTheme;
}
Race Condition Checklist
// Symptoms:
// - Works sometimes, fails sometimes
// - Works locally, fails in production
// - Works slow, fails fast
// Check for:
// 1. Shared mutable state
// 2. Missing await
// 3. Out-of-order execution
// 4. Stale closures
// Example: Stale closure in React
useEffect(() => {
const interval = setInterval(() => {
setCount(count + 1); // Bug: captures stale count
}, 1000);
return () => clearInterval(interval);
}, []); // Missing dependency
// Fix:
setCount(prev => prev + 1); // Use functional update
Phase 6: Production Debugging
Production is Different
Local: Production:
- Full access - Limited access
- debugger works - No debugger
- See all state - Only logs/metrics
- Can change code - Can't redeploy quickly
- One user - Millions of users
- Clean data - Messy data
Production Debugging Toolkit
| Tool | Purpose |
|---|---|
| Structured logs | Search and filter events |
| Distributed tracing | Follow request across services |
| Error tracking | Aggregate and group errors (Sentry, etc.) |
| Metrics/dashboards | Spot anomalies, correlate timing |
| Feature flags | Disable suspected code |
| Canary deploys | Limit blast radius |
The Production Debugging Flow
1. ASSESS IMPACT
- How many users affected?
- Is it getting worse?
- Can we mitigate without fixing?
2. GATHER EVIDENCE
- Logs around the time of errors
- Metrics showing anomalies
- User reports with context
- Recent deployments
3. FORM HYPOTHESES
- What changed recently?
- What's different about affected users?
- What does the error tell us?
4. SAFE INVESTIGATION
- Read-only queries
- Feature flags to disable
- Increase logging temporarily
- Never debug by deploying fixes blindly
5. FIX AND VERIFY
- Fix in staging first
- Canary deployment
- Monitor metrics
- Confirm user reports stop
Phase 7: Distributed Systems Debugging
Following a Request
┌────────┐ ┌─────────┐ ┌────────┐ ┌────────┐
│ Client │──►│ Gateway │──►│ API │──►│ DB │
└────────┘ └─────────┘ └────────┘ └────────┘
│ │ │ │
└─────────────┴─────────────┴────────────┘
Request ID: abc-123
(trace this through all logs)
Correlation IDs
// Generate at entry point
const requestId = generateRequestId();
// Pass through all calls
await orderService.create(order, { requestId });
await paymentService.charge(amount, { requestId });
await notificationService.send(email, { requestId });
// Include in all logs
logger.info('Processing order', { requestId, orderId });
// Now you can grep:
// grep "abc-123" *.log
Distributed Debugging Checklist
- [ ] Is the request reaching the service?
- [ ] Is the request well-formed?
- [ ] Are downstream services responding?
- [ ] Are timeouts configured correctly?
- [ ] Are retries happening? Too many?
- [ ] Is there network partition?
- [ ] Is there resource exhaustion?
- [ ] Are there version mismatches?
Phase 8: Rubber Duck Debugging
Explain the bug out loud.
"Okay duck, so the user clicks submit, then we call the API,
then we... wait, we don't await the response before redirecting.
Never mind duck, I found it."
This works because:
- Forces you to articulate assumptions
- Slows down your thinking
- Often reveals gaps in understanding
- No actual duck required (but recommended)
Phase 9: When You're Stuck
The Stuck Checklist
1. [ ] Have I actually read the error message?
2. [ ] Am I debugging the right thing?
3. [ ] Have I reproduced it reliably?
4. [ ] Have I isolated the problem?
5. [ ] Have I checked recent changes?
6. [ ] Have I searched for the error?
7. [ ] Have I taken a break?
8. [ ] Have I asked someone else to look?
Fresh Eyes Techniques
- Take a break - Walk away for 15 minutes
- Explain to someone - Rubber duck or real person
- Question assumptions - What are you certain of? Are you sure?
- Start over - Fresh approach, ignore previous attempts
- Sleep on it - Seriously, it works
Phase 10: Post-Debugging
The Fix Checklist
Before committing the fix:
- [ ] Does it actually fix the bug?
- [ ] Does it fix the root cause (not just symptom)?
- [ ] Does it break anything else?
- [ ] Is there a regression test?
- [ ] Is the fix the simplest solution?
- [ ] Would future maintainers understand it?
Write a Regression Test
// The bug you just fixed should never come back
describe('UserList', () => {
it('handles empty user array without crashing', () => {
// This was the bug: .map() on undefined
const { container } = render(<UserList users={[]} />);
expect(container).toBeInTheDocument();
});
});
Document for Future
// If the bug was subtle, leave a comment
// Bug fix: users can be undefined during initial load
// See: JIRA-1234, postmortem doc
const userList = users ?? [];
Quick Reference
The 5 Whys
Problem: Website is down
Why? → Server crashed
Why? → Out of memory
Why? → Memory leak
Why? → Event listeners not cleaned up
Why? → Missing cleanup in useEffect
Root cause: Missing useEffect cleanup
Debugging Commands Cheat Sheet
# Git
git bisect start/good/bad/reset # Binary search commits
git log --oneline -20 # Recent commits
git diff HEAD~5 # Changes in last 5 commits
git blame <file> # Who changed what
# Logs
tail -f /var/log/app.log # Follow logs
grep -r "error" ./logs # Search logs
journalctl -u myservice -f # System logs
# Network
curl -v https://api.example.com # Verbose request
nc -zv host port # Check connectivity
tcpdump -i any port 8080 # Packet capture
# Process
ps aux | grep node # Find processes
lsof -i :3000 # What's on port 3000
top -o mem # Memory usage
Rules Reference
| Rule | File |
|---|---|
| Reproduce Before Fix | rules/reproduce-before-fix.md |
| Binary Search First | rules/binary-search-first.md |
| Read Error Messages | rules/read-error-messages.md |
| Check Recent Changes | rules/check-recent-changes.md |
| One Change at a Time | rules/one-change-at-a-time.md |
| Verify the Fix | rules/verify-the-fix.md |
| Write Regression Test | rules/write-regression-test.md |
| Document Root Cause | rules/document-root-cause.md |
| Never Debug Blind | rules/never-debug-blind.md |
| Question Assumptions | rules/question-assumptions.md |
References
| Topic | File |
|---|---|
| Production Debugging | references/production-debugging.md |
| Performance Debugging | references/performance-debugging.md |
| Memory Leak Detection | references/memory-leak-detection.md |
| Distributed Tracing | references/distributed-tracing.md |
| Common Bug Patterns | references/common-bug-patterns.md |
| Debugging Tools Guide | references/debugging-tools.md |
Weekly Installs
2
Repository
tawanorg/skillsFirst Seen
Mar 2, 2026
Security Audits
Installed on
opencode2
gemini-cli2
claude-code2
github-copilot2
codex2
kimi-cli2