root-cause-analysis
Root Cause Analysis
Symptom → hypothesis formation → evidence gathering → elimination → root cause → verified fix.
<when_to_use>
- Diagnosing system failures or unexpected behavior
- Investigating incidents or outages
- Finding the actual cause vs surface symptoms
- Preventing recurrence through understanding
- Any situation where "why did this happen?" needs answering
NOT for: known issues with documented fixes, simple configuration errors, guessing without evidence
</when_to_use>
<discovery_phase>
Core Questions
| Question | Why it matters |
|---|---|
| What's the symptom? | Exact manifestation of the problem |
| When did it start? | First occurrence, patterns in timing |
| Can you reproduce it? | Consistently, intermittently, specific conditions |
| What changed recently? | Deployments, config, dependencies, environment |
| What have you tried? | Previous fix attempts, their results |
| What are the constraints? | Time budget, what can't be modified |
Confidence Thresholds
| Level | State | Action |
|---|---|---|
| 0-2 | Symptom unclear or can't reproduce | Keep gathering info |
| 3 | Good context, some gaps | Can start hypothesis phase |
| 4+ | Clear picture | Proceed to investigation |
At level 3+, transition to hypothesis formation. Below level 3, keep gathering context.
</discovery_phase>
<hypothesis_formation>
Quality Criteria
| Good Hypothesis | Weak Hypothesis |
|---|---|
| Testable | Too broad ("something's wrong") |
| Falsifiable | Untestable |
| Specific | Contradicts evidence |
| Plausible | Assumes conclusion |
Multiple Working Hypotheses
Generate 2-4 competing theories:
- List each hypothesis with supporting/contradicting evidence
- Rank by likelihood (evidence support, parsimony, testability)
- Design tests to differentiate between them
</hypothesis_formation>
<evidence_gathering>
Observation Collection
| Category | What to Gather |
|---|---|
| Error manifestation | Exact symptoms, messages, states |
| Reproduction steps | Minimal sequence triggering issue |
| System state | Logs, variables, config at failure time |
| Environment | Versions, platform, dependencies |
| Timing | When started, frequency, patterns |
Breadcrumb Analysis
Trace backwards from symptom:
- Last known good state — what was working?
- First observable failure — when did it break?
- Changes between — what's different?
- Root trigger — first thing that went wrong
</evidence_gathering>
<hypothesis_testing>
Test Design
For each hypothesis:
- Prediction — if true, what should we observe?
- Test method — how to verify?
- Expected result — what confirms/refutes?
- Time budget — when to move on?
Testing Priorities
| Priority | Strategy |
|---|---|
| First | Quick, non-destructive, local tests |
| Second | Most likely causes, common failures |
| Third | Edge cases, rare failures |
Execution Loop
Baseline → Single variable change → Observe → Document → Iterate
</hypothesis_testing>
<elimination_methodology>
Three core techniques:
| Technique | When to Use |
|---|---|
| Binary Search | Large problem space, ordered changes |
| Variable Isolation | Multiple variables, need causation |
| Process of Elimination | Finite set of possible causes |
See elimination-techniques.md for detailed methods.
</elimination_methodology>
<time_boxing>
| Phase | Duration | Exit Condition |
|---|---|---|
| Discovery | 5-10 min | Questions answered, can reproduce |
| Hypothesis | 10-15 min | 2-4 testable theories ranked |
| Testing | 15-30 min per hypothesis | Confirmed or ruled out |
| Fix | Variable | Root cause addressed |
| Verification | 10-15 min | Fix confirmed, prevention documented |
If stuck beyond 2x estimate → step back, seek fresh perspective, or escalate.
</time_boxing>
<audit_trail>
Log every step:
[TIME] PHASE: Action → Result
[10:15] DISCOVERY: Gathered error logs → Found NullPointerException
[10:22] HYPOTHESIS: User object not initialized
[10:28] TEST: Added null check logging → Confirmed user is null
Benefits: Prevents revisiting same ground, enables handoff, catches circular investigation.
See documentation-templates.md for full templates.
</audit_trail>
<common_pitfalls>
Watch for these patterns:
| Trap | Counter |
|---|---|
| "I already looked at that" | Re-examine with fresh evidence |
| "That can't be the issue" | Test anyway, let evidence decide |
| "We need to fix this quickly" | Methodical investigation is faster |
| Confirmation bias | Actively seek disconfirming evidence |
| Correlation = causation | Test direct causal mechanism |
See pitfalls.md for detailed resistance patterns and recovery.
</common_pitfalls>
<confidence_calibration>
| Level | Indicators |
|---|---|
| High | Consistent reproduction, clear cause-effect, multiple confirmations, fix verified |
| Moderate | Reproduces mostly, strong correlation, single confirmation |
| Low | Inconsistent reproduction, unclear correlation, unverified hypothesis |
</confidence_calibration>
ALWAYS:
- Gather sufficient context before hypothesizing
- Form multiple competing hypotheses
- Test systematically, one variable at a time
- Document investigation trail
- Verify fix actually addresses root cause
- Document for future prevention
NEVER:
- Jump to solutions without diagnosis
- Trust single hypothesis without testing alternatives
- Apply fixes without understanding cause
- Skip verification of fix
- Repeat same failed investigation steps
- Hide uncertainty about root cause
Deep-dive documentation:
- elimination-techniques.md — binary search, variable isolation, process of elimination
- pitfalls.md — cognitive biases and resistance patterns
- documentation-templates.md — investigation logs and RCA reports
Related skills:
- debugging-and-diagnosis — code-specific debugging (loads this skill)
- codebase-analysis — uses for code investigation
- report-findings — presenting investigation results
More from outfitter-dev/agents
codebase-recon
This skill should be used when analyzing codebases, understanding architecture, or when "analyze", "investigate", "explore code", or "understand architecture" are mentioned.
93graphite-stacks
This skill should be used when the user asks to "create a stack", "submit stacked PRs", "gt submit", "gt create", "reorganize branches", "fix stack corruption", or mentions Graphite, stacked PRs, gt commands, or trunk-based development workflows.
76code-review
This skill should be used when reviewing code before commit, conducting quality gates, or when "review", "fresh eyes", "pre-commit review", or "quality gate" are mentioned.
34hono-dev
This skill should be used when building APIs with Hono, using hc client, implementing OpenAPI, or when "Hono", "RPC", or "type-safe API" are mentioned.
28software-craft
This skill should be used when making design decisions, evaluating trade-offs, assessing code quality, or when "engineering judgment" or "code quality" are mentioned.
28subagents
This skill should be used when coordinating agents, delegating tasks to specialists, or when "dispatch agents", "which agent", or "multi-agent" are mentioned.
25