root-cause-analysis
Root Cause Analysis
Symptom → hypothesis formation → evidence gathering → elimination → root cause → verified fix.
<when_to_use>
- Diagnosing system failures or unexpected behavior
- Investigating incidents or outages
- Finding the actual cause vs surface symptoms
- Preventing recurrence through understanding
- Any situation where "why did this happen?" needs answering
NOT for: known issues with documented fixes, simple configuration errors, guessing without evidence
</when_to_use>
<discovery_phase>
Core Questions
| Question | Why it matters |
|---|---|
| What's the symptom? | Exact manifestation of the problem |
| When did it start? | First occurrence, patterns in timing |
| Can you reproduce it? | Consistently, intermittently, specific conditions |
| What changed recently? | Deployments, config, dependencies, environment |
| What have you tried? | Previous fix attempts, their results |
| What are the constraints? | Time budget, what can't be modified |
Confidence Thresholds
| Level | State | Action |
|---|---|---|
| 0-2 | Symptom unclear or can't reproduce | Keep gathering info |
| 3 | Good context, some gaps | Can start hypothesis phase |
| 4+ | Clear picture | Proceed to investigation |
At level 3+, transition to hypothesis formation. Below level 3, keep gathering context.
</discovery_phase>
<hypothesis_formation>
Quality Criteria
| Good Hypothesis | Weak Hypothesis |
|---|---|
| Testable | Too broad ("something's wrong") |
| Falsifiable | Untestable |
| Specific | Contradicts evidence |
| Plausible | Assumes conclusion |
Multiple Working Hypotheses
Generate 2-4 competing theories:
- List each hypothesis with supporting/contradicting evidence
- Rank by likelihood (evidence support, parsimony, testability)
- Design tests to differentiate between them
</hypothesis_formation>
<evidence_gathering>
Observation Collection
| Category | What to Gather |
|---|---|
| Error manifestation | Exact symptoms, messages, states |
| Reproduction steps | Minimal sequence triggering issue |
| System state | Logs, variables, config at failure time |
| Environment | Versions, platform, dependencies |
| Timing | When started, frequency, patterns |
Breadcrumb Analysis
Trace backwards from symptom:
- Last known good state — what was working?
- First observable failure — when did it break?
- Changes between — what's different?
- Root trigger — first thing that went wrong
</evidence_gathering>
<hypothesis_testing>
Test Design
For each hypothesis:
- Prediction — if true, what should we observe?
- Test method — how to verify?
- Expected result — what confirms/refutes?
- Time budget — when to move on?
Testing Priorities
| Priority | Strategy |
|---|---|
| First | Quick, non-destructive, local tests |
| Second | Most likely causes, common failures |
| Third | Edge cases, rare failures |
Execution Loop
Baseline → Single variable change → Observe → Document → Iterate
</hypothesis_testing>
<elimination_methodology>
Three core techniques:
| Technique | When to Use |
|---|---|
| Binary Search | Large problem space, ordered changes |
| Variable Isolation | Multiple variables, need causation |
| Process of Elimination | Finite set of possible causes |
See elimination-techniques.md for detailed methods.
</elimination_methodology>
<time_boxing>
| Phase | Duration | Exit Condition |
|---|---|---|
| Discovery | 5-10 min | Questions answered, can reproduce |
| Hypothesis | 10-15 min | 2-4 testable theories ranked |
| Testing | 15-30 min per hypothesis | Confirmed or ruled out |
| Fix | Variable | Root cause addressed |
| Verification | 10-15 min | Fix confirmed, prevention documented |
If stuck beyond 2x estimate → step back, seek fresh perspective, or escalate.
</time_boxing>
<audit_trail>
Log every step:
[TIME] PHASE: Action → Result
[10:15] DISCOVERY: Gathered error logs → Found NullPointerException
[10:22] HYPOTHESIS: User object not initialized
[10:28] TEST: Added null check logging → Confirmed user is null
Benefits: Prevents revisiting same ground, enables handoff, catches circular investigation.
See documentation-templates.md for full templates.
</audit_trail>
<common_pitfalls>
Watch for these patterns:
| Trap | Counter |
|---|---|
| "I already looked at that" | Re-examine with fresh evidence |
| "That can't be the issue" | Test anyway, let evidence decide |
| "We need to fix this quickly" | Methodical investigation is faster |
| Confirmation bias | Actively seek disconfirming evidence |
| Correlation = causation | Test direct causal mechanism |
See pitfalls.md for detailed resistance patterns and recovery.
</common_pitfalls>
<confidence_calibration>
| Level | Indicators |
|---|---|
| High | Consistent reproduction, clear cause-effect, multiple confirmations, fix verified |
| Moderate | Reproduces mostly, strong correlation, single confirmation |
| Low | Inconsistent reproduction, unclear correlation, unverified hypothesis |
</confidence_calibration>
ALWAYS:
- Gather sufficient context before hypothesizing
- Form multiple competing hypotheses
- Test systematically, one variable at a time
- Document investigation trail
- Verify fix actually addresses root cause
- Document for future prevention
NEVER:
- Jump to solutions without diagnosis
- Trust single hypothesis without testing alternatives
- Apply fixes without understanding cause
- Skip verification of fix
- Repeat same failed investigation steps
- Hide uncertainty about root cause
Deep-dive documentation:
- elimination-techniques.md — binary search, variable isolation, process of elimination
- pitfalls.md — cognitive biases and resistance patterns
- documentation-templates.md — investigation logs and RCA reports
Related skills:
- debugging-and-diagnosis — code-specific debugging (loads this skill)
- codebase-analysis — uses for code investigation
- report-findings — presenting investigation results