debugging
Mode: Cognitive/Prompt-Driven — No standalone utility script; use via agent context.
Systematic Debugging
Overview
Random fixes waste time and create new bugs. Quick patches mask underlying issues.
Core principle: ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
Violating the letter of this process is violating the spirit of debugging.
Iron Laws
- NEVER propose or implement a fix before completing Phase 1 root cause investigation — a fix without root cause is a guess that will fail or create new bugs.
- ALWAYS reproduce the bug reliably before debugging — if you can't reproduce it consistently, you're not debugging the real issue.
- NEVER make more than one change at a time when testing a hypothesis — multiple simultaneous changes make it impossible to determine which change fixed the problem.
- ALWAYS stop and question the architecture after 3 failed fix attempts — if each fix reveals a new problem, the issue is architectural, not symptomatic.
- NEVER skip creating a failing test case before implementing the fix — without a test, you cannot verify the fix worked or that it won't regress.
When to Use
When to Use
Use for ANY technical issue:
- Test failures
- Bugs in production
- Unexpected behavior
- Performance problems
- Build failures
- Integration issues
Use this ESPECIALLY when:
- Under time pressure (emergencies make guessing tempting)
- "Just one quick fix" seems obvious
- You've already tried multiple fixes
- Previous fix didn't work
- You don't fully understand the issue
Don't skip when:
- Issue seems simple (simple bugs have root causes too)
- You're in a hurry (rushing guarantees rework)
- Manager wants it fixed NOW (systematic is faster than thrashing)
When to Use: debugging vs smart-debug
| Scenario | Use debugging |
Use smart-debug |
|---|---|---|
| Simple, locally reproducible bug | Yes | Overkill |
| Root cause area already known | Yes | Optional |
| Static analysis / code review bug | Yes | No |
| Runtime / production issue | Start here | Preferred |
| Intermittent / hard-to-reproduce | Escalate | Yes |
| Needs hypothesis ranking gate | No | Yes (blocking) |
| Needs instrumentation + log analysis | No | Yes |
| Observability-driven (traces, APM) | No | Yes |
Rule of thumb: Start with debugging for straightforward bugs. Escalate to smart-debug when you need hypothesis ranking, structured instrumentation, or the bug is intermittent/production-only.
See also: .claude/skills/smart-debug/SKILL.md
The Four Phases
You MUST complete each phase before proceeding to the next.
Phase 1: Root Cause Investigation
BEFORE attempting ANY fix:
-
Read Error Messages Carefully
- Don't skip past errors or warnings
- They often contain the exact solution
- Read stack traces completely
- Note line numbers, file paths, error codes
-
Reproduce Consistently
- Can you trigger it reliably?
- What are the exact steps?
- Does it happen every time?
- If not reproducible - gather more data, don't guess
-
Check Recent Changes
- What changed that could cause this?
- Git diff, recent commits
- New dependencies, config changes
- Environmental differences
-
Gather Evidence in Multi-Component Systems
WHEN system has multiple components (CI - build - signing, API - service - database):
BEFORE proposing fixes, add diagnostic instrumentation:
For EACH component boundary: - Log what data enters component - Log what data exits component - Verify environment/config propagation - Check state at each layer Run once to gather evidence showing WHERE it breaks THEN analyze evidence to identify failing component THEN investigate that specific componentExample (multi-layer system):
# Layer 1: Workflow echo "=== Secrets available in workflow: ===" echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}" # Layer 2: Build script echo "=== Env vars in build script: ===" env | grep IDENTITY || echo "IDENTITY not in environment" # Layer 3: Signing script echo "=== Keychain state: ===" security list-keychains security find-identity -v # Layer 4: Actual signing codesign --sign "$IDENTITY" --verbose=4 "$APP"This reveals: Which layer fails (secrets - workflow OK, workflow - build FAIL)
For distributed/microservice systems — prefer OpenTelemetry traces:
# Query traces by component (preferred over manual echo/env logging) pnpm trace:query --component <service-name> --event <event-name> --since <ISO-8601> --limit 200 # When trace ID is already known pnpm trace:query --trace-id <traceId> --compact --since <ISO-8601> --limit 200Fragmented traces (each service has its own root span, trace IDs don't match across boundaries) = broken context propagation. Fix
traceparent/tracestateheader forwarding before investigating business logic.Instrumentation Gate (before hypothesis generation): If runtime behavior remains unclear after static analysis, add targeted log statements at key decision nodes before generating hypotheses. Use session-scoped log files (
.claude/context/tmp/debug-{sessionId}.log) to capture runtime state. Human-in-the-loop: ask the user to reproduce the bug after instrumentation is added, before analyzing results. Only proceed to Phase 2 once runtime evidence is collected. -
Trace Data Flow
WHEN error is deep in call stack:
See
root-cause-tracing.mdin this directory for the complete backward tracing technique.Quick version:
- Where does bad value originate?
- What called this with bad value?
- Keep tracing up until you find the source
- Fix at source, not at symptom
Phase 2: Pattern Analysis
Find the pattern before fixing:
-
Find Working Examples
- Locate similar working code in same codebase
- What works that's similar to what's broken?
-
Compare Against References
- If implementing pattern, read reference implementation COMPLETELY
- Don't skim - read every line
- Understand the pattern fully before applying
-
Identify Differences
- What's different between working and broken?
- List every difference, however small
- Don't assume "that can't matter"
-
Understand Dependencies
- What other components does this need?
- What settings, config, environment?
- What assumptions does it make?
Phase 3: Hypothesis and Testing
Scientific method:
-
Form Single Hypothesis
- State clearly: "I think X is the root cause because Y"
- Write it down
- Be specific, not vague
-
Test Minimally
- Make the SMALLEST possible change to test hypothesis
- One variable at a time
- Don't fix multiple things at once
-
Verify Before Continuing
- Did it work? Yes - Phase 4
- Didn't work? Form NEW hypothesis
- DON'T add more fixes on top
-
When You Don't Know
- Say "I don't understand X"
- Don't pretend to know
- Ask for help
- Research more
Phase 4: Implementation
Fix the root cause, not the symptom:
-
Create Failing Test Case
- Simplest possible reproduction
- Automated test if possible
- One-off test script if no framework
- MUST have before fixing
- Use the
tddskill for writing proper failing tests
-
Implement Single Fix
- Address the root cause identified
- ONE change at a time
- No "while I'm here" improvements
- No bundled refactoring
-
Verify Fix
- Test passes now?
- No other tests broken?
- Issue actually resolved?
-
Cleanup
- Remove all instrumentation added for this debug session (log statements, temporary diagnostics)
- Verify cleanup: grep for the session debug ID or instrumentation markers to confirm no debug artifacts remain in production code
- Example:
rg "debug-{sessionId}" --type-add 'src:*.{js,ts,cjs,mjs}' -tsrc .
-
If Fix Doesn't Work
- STOP
- Count: How many fixes have you tried?
- If < 3: Return to Phase 1, re-analyze with new information
- If >= 3: STOP and question the architecture (step 6 below)
- DON'T attempt Fix #4 without architectural discussion
-
If 3+ Fixes Failed: Question Architecture
Pattern indicating architectural problem:
- Each fix reveals new shared state/coupling/problem in different place
- Fixes require "massive refactoring" to implement
- Each fix creates new symptoms elsewhere
STOP and question fundamentals:
- Is this pattern fundamentally sound?
- Are we "sticking with it through sheer inertia"?
- Should we refactor architecture vs. continue fixing symptoms?
Discuss with your human partner before attempting more fixes
This is NOT a failed hypothesis - this is a wrong architecture.
Red Flags - STOP and Follow Process
If you catch yourself thinking:
- "Quick fix for now, investigate later"
- "Just try changing X and see if it works"
- "Add multiple changes, run tests"
- "Skip the test, I'll manually verify"
- "It's probably X, let me fix that"
- "I don't fully understand but this might work"
- "Pattern says X but I'll adapt it differently"
- "Here are the main problems: [lists fixes without investigation]"
- Proposing solutions before tracing data flow
- "One more fix attempt" (when already tried 2+)
- Each fix reveals new problem in different place
ALL of these mean: STOP. Return to Phase 1.
If 3+ fixes failed: Question the architecture (see Phase 4.5)
Your Human Partner's Signals You're Doing It Wrong
Watch for these redirections:
- "Is that not happening?" - You assumed without verifying
- "Will it show us...?" - You should have added evidence gathering
- "Stop guessing" - You're proposing fixes without understanding
- "Ultrathink this" - Question fundamentals, not just symptoms
- "We're stuck?" (frustrated) - Your approach isn't working
When you see these: STOP. Return to Phase 1.
Common Rationalizations
| Excuse | Reality |
|---|---|
| "Issue is simple, don't need process" | Simple issues have root causes too. Process is fast for simple bugs. |
| "Emergency, no time for process" | Systematic debugging is FASTER than guess-and-check thrashing. |
| "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. |
| "I'll write test after confirming fix works" | Untested fixes don't stick. Test first proves it. |
| "Multiple fixes at once saves time" | Can't isolate what worked. Causes new bugs. |
| "Reference too long, I'll adapt the pattern" | Partial understanding guarantees bugs. Read it completely. |
| "I see the problem, let me fix it" | Seeing symptoms does not equal understanding root cause. |
| "One more fix attempt" (after 2+ failures) | 3+ failures = architectural problem. Question pattern, don't fix again. |
Quick Reference
| Phase | Key Activities | Success Criteria |
|---|---|---|
| 1. Root Cause | Read errors, reproduce, check changes, gather evidence | Understand WHAT and WHY |
| 2. Pattern | Find working examples, compare | Identify differences |
| 3. Hypothesis | Form theory, test minimally | Confirmed or new hypothesis |
| 4. Implementation | Create test, fix, verify | Bug resolved, tests pass |
When Process Reveals "No Root Cause"
If systematic investigation reveals issue is truly environmental, timing-dependent, or external:
- You've completed the process
- Document what you investigated
- Implement appropriate handling (retry, timeout, error message)
- Add monitoring/logging for future investigation
But: 95% of "no root cause" cases are incomplete investigation.
Supporting Techniques
These techniques are part of systematic debugging and available in this directory:
root-cause-tracing.md- Trace bugs backward through call stack to find original triggerdefense-in-depth.md- Add validation at multiple layers after finding root causecondition-based-waiting.md- Replace arbitrary timeouts with condition polling- find-polluter - For test pollution bisection (flaky tests due to shared state): run
.claude/tools/analysis/find-polluter/find-polluter.sh(orfind-polluter.ps1on Windows) from the project root to isolate which test pollutes the suite.
Related skills:
- tdd - For creating failing test case (Phase 4, Step 1)
- verification-before-completion - Verify fix worked before claiming success
Real-World Impact
From debugging sessions:
- Systematic approach: 15-30 minutes to fix
- Random fixes approach: 2-3 hours of thrashing
- First-time fix rate: 95% vs 40%
- New bugs introduced: Near zero vs common
AI-Assisted Debugging & Modern Observability (2025+)
OpenTelemetry: The New Stack Trace
For distributed systems, OpenTelemetry traces replace manual echo/env evidence gathering. A trace shows the complete request journey across service boundaries via span IDs and trace IDs (W3C Trace Context standard: traceparent/tracestate headers).
Evidence hierarchy for distributed failures (prefer in order):
1. Distributed traces (OpenTelemetry spans, correlated trace IDs)
2. Structured logs with correlation IDs
3. Metrics with timestamps
4. Manual instrumentation (Phase 1 Step 4 bash examples)
Common symptom — fragmented traces: Each service shows its own root span, trace IDs don't match across boundaries. This means context propagation is broken — fix header forwarding before investigating business logic.
AI-Assisted Root Cause Analysis
LLM-based debugging agents (2025 pattern) augment Phase 1 by reading production traces and correlating with codebase context to suggest minimal reproduction cases.
Use AI assistance for:
- High-complexity distributed failures with multi-service blast radius
- On-call incidents requiring rapid root cause identification
- Converting production traces into deterministic test reproducers
Do NOT skip Phase 1 when using AI assistance. AI suggestions are hypotheses — apply Phase 3 (hypothesis testing) before implementing any AI-suggested fix. AI cannot replace systematic investigation; it accelerates evidence gathering.
Anti-Patterns
| Anti-Pattern | Why It Fails | Correct Approach |
|---|---|---|
| "Quick fix for now, investigate later" | The quick fix becomes permanent; the root cause resurfaces as a different symptom | Always complete Phase 1 before touching production code |
| Making multiple changes at once | Can't determine which change fixed or broke the system; creates regressions | One change per hypothesis test; verify before the next change |
| Proposing AI-suggested fixes without testing | AI suggestions are hypotheses, not facts; applying them blindly skips Phase 3 | Treat AI suggestions as hypotheses to test, not answers to implement |
| Attempting a 4th fix after 3 failures | N+1 fix attempts on a broken approach compound the problem | After 3 failed fixes, escalate to architecture review |
| Skipping the failing test before the fix | You can't verify the fix worked, and regressions are invisible | Create the failing test first; it proves root cause and verifies fix |
Memory Protocol (MANDATORY)
Before starting:
Read .claude/context/memory/learnings.md
After completing:
- New pattern ->
.claude/context/memory/learnings.md - Issue found ->
.claude/context/memory/issues.md - Decision made ->
.claude/context/memory/decisions.md
ASSUME INTERRUPTION: If it's not in memory, it didn't happen.