5-Whys Root Cause Analysis
5-Whys Root Cause Analysis
A systematic technique for drilling down through symptoms to uncover the true root cause of a problem by repeatedly asking "Why?" until the fundamental issue is revealed.
When to Use This Skill
- Bug investigation where the obvious fix didn't work
- Production incidents requiring post-mortem analysis
- Performance problems with unclear origins
- Recurring issues that keep coming back after "fixes"
- System failures requiring prevention, not just recovery
- Any situation where treating symptoms isn't enough
Core Process
Phase 1: Define the Problem Clearly
State the problem as a specific, observable fact:
Good problem statements:
- "The API response time increased from 50ms to 500ms"
- "Users are seeing 500 errors on the checkout page"
- "The nightly job failed at 3:00 AM"
Poor problem statements:
- "The system is slow" (too vague)
- "Something is broken" (not specific)
- "Users are unhappy" (symptom, not problem)
Problem Statement Template:
What: [Specific observable behavior]
When: [Time/conditions when it occurs]
Where: [Component/system affected]
Impact: [Measurable consequence]
Phase 2: Ask "Why?" Iteratively
For each answer, ask "Why does that happen?" until reaching an actionable root cause:
The 5-Whys Chain:
Problem: [Statement]
↓
Why 1: [First-level cause]
↓
Why 2: [Deeper cause]
↓
Why 3: [Even deeper]
↓
Why 4: [Approaching root]
↓
Why 5: [Root cause - actionable]
Quality Checks for Each "Why":
- Is this answer factual and verifiable?
- Does this explain the previous level?
- Is there evidence supporting this?
- Could there be multiple causes at this level?
Phase 3: Identify the Root Cause
A true root cause has these characteristics:
| Characteristic | Test |
|---|---|
| Actionable | Can we do something about it? |
| Preventable | Would fixing this prevent recurrence? |
| Fundamental | Asking "why" again yields nothing actionable |
| Verifiable | Can we prove this is the cause? |
Stop Conditions:
- Reached a process/policy that can be changed
- Found a missing control or check
- Identified a knowledge/training gap
- Discovered a design flaw
- Hit a resource constraint decision
Phase 4: Validate the Chain
Work backwards through the chain:
If [Root Cause] is fixed
→ Then [Why 4] wouldn't happen
→ Then [Why 3] wouldn't happen
→ Then [Why 2] wouldn't happen
→ Then [Why 1] wouldn't happen
→ Then [Problem] wouldn't occur
If the chain breaks at any point, revisit that level.
Phase 5: Define Countermeasures
For the root cause, define:
- Immediate fix - Stop the bleeding
- Preventive measure - Ensure it never happens again
- Detection mechanism - Catch it early if prevention fails
Output Format
## 5-Whys Analysis: [Problem Title]
### Problem Statement
**What:** [Specific behavior]
**When:** [Time/conditions]
**Where:** [Component]
**Impact:** [Consequence]
### Why Chain
| Level | Question | Answer | Evidence |
|-------|----------|--------|----------|
| Why 1 | Why did [problem] occur? | [Answer] | [Evidence] |
| Why 2 | Why did [Why 1 answer]? | [Answer] | [Evidence] |
| Why 3 | Why did [Why 2 answer]? | [Answer] | [Evidence] |
| Why 4 | Why did [Why 3 answer]? | [Answer] | [Evidence] |
| Why 5 | Why did [Why 4 answer]? | [Answer] | [Evidence] |
### Root Cause
**Identified cause:** [Statement]
**Type:** [Process/Design/Knowledge/Resource]
**Confidence:** [High/Medium/Low]
### Validation
- [Root cause fixed] → [Why 4 prevented] ✓
- [Why 4 prevented] → [Why 3 prevented] ✓
- ... chain validates ...
### Countermeasures
| Type | Action | Owner | Timeline |
|------|--------|-------|----------|
| Immediate | [Action] | [Who] | [When] |
| Preventive | [Action] | [Who] | [When] |
| Detection | [Action] | [Who] | [When] |
Common Pitfalls
Pitfall 1: Stopping Too Early
Symptom: Root cause is still a symptom
Problem: Server crashed
Why 1: Out of memory
→ "Fix: Add more memory" ❌ (Treating symptom)
Continue:
Why 2: Memory leak in service X
Why 3: Connection pool not releasing connections
Why 4: Exception handler not closing connections
Why 5: No finally block in database code
→ Fix: Add proper resource cleanup ✓
Pitfall 2: Blame Instead of Cause
Wrong: "Why? → Developer made a mistake" Right: "Why? → No code review caught the issue" Even better: "Why? → No automated test for this case"
Rule: Focus on process and systems, not individuals.
Pitfall 3: Single Thread When Multiple Causes Exist
Sometimes problems have multiple contributing factors:
Problem: Deployment failed
↓
Why 1: Database migration timed out
├─→ Branch A: Why did migration take so long?
│ └─→ Table lock held too long
│ └─→ Long-running query
│ └─→ Missing index
│
└─→ Branch B: Why is timeout so short?
└─→ Default timeout used
└─→ No deployment-specific config
Pitfall 4: Unverified Assumptions
Each "why" should be supported by evidence:
| Level | Answer | Evidence Required |
|---|---|---|
| Why 1 | "Service crashed" | Logs showing crash |
| Why 2 | "OOM killed" | dmesg/system logs |
| Why 3 | "Memory leak" | Heap dump analysis |
| Why 4 | "Unclosed streams" | Code inspection |
| Why 5 | "Missing finally" | Git blame |
Integration with Other Tools
| Tool | When to Combine |
|---|---|
| First Principles | When questioning if the problem definition itself is right |
| Hypothesis Testing | When evidence for a "why" is uncertain |
| Pre-mortem | After fixing, to prevent similar issues |
| Trade-off Analysis | When choosing between countermeasures |
Boundaries
Will:
- Systematically trace problems to root causes
- Ensure each level is evidence-based
- Identify actionable countermeasures
- Handle multi-branch cause trees
Will Not:
- Stop at blame ("human error")
- Accept vague answers without evidence
- Guarantee exactly 5 levels (might be 3, might be 7)
- Replace detailed debugging when code inspection is needed
Quick Reference
The 5-Whys Checklist:
- Problem stated specifically and measurably
- Each "why" is factual, not assumed
- Evidence supports each level
- Root cause is actionable and preventable
- Chain validates when traced backwards
- Countermeasures address root cause, not symptoms
- Process/system focus, not blame
Additional Resources
Reference Files
references/toyota-origins.md- History and principles from Toyota Production Systemreferences/software-patterns.md- Common root cause patterns in software
Example Files
examples/production-incident.md- Complete analysis of a production outageexamples/performance-regression.md- Tracing a performance degradation