thinking-five-whys-plus
Five Whys Plus
Overview
The Five Whys technique from Toyota Production System is powerful but often misapplied. This enhanced version adds explicit guards against common failures: premature stopping, single-cause bias, blame-oriented thinking, and confirmation bias. It transforms a simple technique into a rigorous root cause methodology.
Core Principle: Keep asking "why" until you reach actionable root causes, but guard against the technique's known failure modes.
When to Use
- Incident post-mortems
- Bug investigations
- Process failures
- Customer complaints
- Recurring problems
- Any situation where you need root cause, not just proximate cause
Decision flow:
Problem occurred?
→ Is the cause obvious and verified? → yes → Fix directly
→ Need to find root cause? → yes → APPLY FIVE WHYS PLUS
→ Is this a complex multi-factor problem? → yes → Consider Kepner-Tregoe PA
Standard Five Whys Failure Modes
| Failure Mode | Description | Guard |
|---|---|---|
| Premature stopping | Accepting first plausible cause | Minimum depth + actionability test |
| Single-cause bias | Assuming one root cause | Branch on "what else?" |
| Blame orientation | Stopping at human error | "Why was error possible?" |
| Confirmation bias | Finding expected cause | Devil's advocate review |
| Circular reasoning | Why loops back on itself | Detect and break cycles |
| Speculation depth | Going beyond evidence | Evidence requirement |
The Five Whys Plus Process
Step 1: State the Problem Precisely
Bad: "The system was slow" Good: "API response times exceeded 2 seconds for 30% of requests between 14:00-14:45 UTC on January 15"
Problem Statement:
- What happened: [Specific observable symptom]
- When: [Time range]
- Where: [Affected systems/users]
- Extent: [Scope and severity]
- Impact: [Business/user impact]
Step 2: Apply "Why" with Evidence Requirement
For each "why," require evidence:
Why #1: Why did [problem] occur?
Answer: [Hypothesis]
Evidence: [Data, logs, metrics that support this]
Confidence: [High/Medium/Low]
Evidence types:
- Logs showing the event
- Metrics correlating with timeline
- Code showing the behavior
- Configuration proving the state
- Testimony from multiple sources
Step 3: Branch on "What Else?"
After each "why," explicitly ask "what else could cause this?"
Why #1: Why did API response times spike?
Primary answer: Database queries were slow
Evidence: DB query times increased from 50ms to 1.5s
What else could cause this?
- [ ] Network latency (checked: normal)
- [ ] Application code changes (checked: none deployed)
- [ ] Memory pressure (checked: normal)
- [ ] External API dependencies (checked: normal)
→ Proceeding with database queries as verified cause
Step 4: Apply "Why Was This Possible?" for Human Error
Never stop at "human error" or "someone made a mistake."
BAD chain:
Why did the outage occur? → Config was wrong
Why was config wrong? → Engineer made a typo
→ STOP (blames human)
GOOD chain:
Why did the outage occur? → Config was wrong
Why was config wrong? → Engineer made a typo
Why was a typo possible? → No validation on config changes
Why was there no validation? → Config system doesn't support schemas
Why doesn't it support schemas? → Tech debt, never prioritized
→ ROOT CAUSE: Config validation infrastructure gap
Step 5: Check Stopping Criteria
Only stop when ALL are true:
| Criterion | Question | ✓ |
|---|---|---|
| Actionable | Can we take concrete action on this cause? | |
| Controllable | Is this within our control to fix? | |
| Fundamental | Would fixing this prevent recurrence? | |
| Evidenced | Do we have evidence, not just speculation? | |
| Not-blame | Is this a system issue, not just "someone messed up"? |
Step 6: Verify with Counter-Analysis
Before finalizing, apply devil's advocate:
Proposed root cause: [X]
Counter-analysis:
1. What evidence contradicts this conclusion?
2. What other explanation fits the evidence?
3. Would someone with a different perspective agree?
4. If we fix X, are we confident the problem won't recur?
5. Are we finding what we expected to find? (confirmation bias check)
Enhanced Template
# Five Whys Plus Analysis
## Problem Statement
- **What:** [Specific symptom]
- **When:** [Time range]
- **Where:** [Affected scope]
- **Impact:** [Severity and consequences]
## Why Chain
### Why #1: Why did [problem] occur?
**Answer:**
**Evidence:**
**Confidence:** High / Medium / Low
**What else considered:**
**Ruled out because:**
### Why #2: Why did [answer #1] occur?
**Answer:**
**Evidence:**
**Confidence:**
**What else considered:**
**Ruled out because:**
### Why #3: Why did [answer #2] occur?
**Answer:**
**Evidence:**
**Confidence:**
**What else considered:**
**Ruled out because:**
[Continue as needed...]
## Stopping Criteria Check
- [ ] Actionable: We can take concrete action
- [ ] Controllable: Within our control
- [ ] Fundamental: Prevents recurrence
- [ ] Evidenced: Supported by data
- [ ] System-focused: Not blaming individuals
## Counter-Analysis
**Contradicting evidence:**
**Alternative explanations:**
**Confirmation bias check:**
**Confidence in conclusion:**
## Root Causes Identified
1. [Primary root cause]
2. [Contributing factor if applicable]
## Recommended Actions
| Action | Addresses | Owner | Timeline |
|--------|-----------|-------|----------|
| | | | |
## Verification Plan
How will we know the fix worked?
Example: Production Outage
# Five Whys Plus: Payment Service Outage
## Problem Statement
- What: Payment service returned 500 errors
- When: 2024-01-15 14:00-14:45 UTC
- Where: Production, US-East region
- Impact: 2,400 failed transactions, ~$180K revenue impact
## Why Chain
### Why #1: Why did payment service return 500 errors?
**Answer:** Database connection pool exhausted
**Evidence:** Connection pool metrics showed 100/100 in use, logs show "connection wait timeout"
**Confidence:** High
**What else considered:**
- Application bugs (no recent deploys)
- Memory issues (heap normal)
- Network problems (latency normal)
### Why #2: Why was connection pool exhausted?
**Answer:** Queries taking 10x longer than normal
**Evidence:** P99 query time went from 50ms to 500ms at 14:00
**Confidence:** High
**What else considered:**
- Connection leak (connection count stable before incident)
- Sudden traffic spike (traffic was normal)
### Why #3: Why were queries taking 10x longer?
**Answer:** Missing index on payment_status table
**Evidence:** EXPLAIN shows sequential scan on 10M row table
**Confidence:** High
**What else considered:**
- Lock contention (no blocking locks)
- DB resource exhaustion (CPU/memory normal)
### Why #4: Why was the index missing?
**Answer:** Migration to add index was rolled back 2 weeks ago
**Evidence:** Deployment logs show rollback on 2024-01-01
**Confidence:** High
### Why #5: Why was the migration rolled back?
**Answer:** Migration timed out during deploy window
**Evidence:** Deploy log shows "migration timeout after 30 minutes"
### Why #6: Why did migration timeout?
**Answer:** Table too large for online migration in current window
**Evidence:** Table has 10M rows, online migration takes ~2 hours
**Confidence:** High
### Why #7 (System-level): Why wasn't this caught before impact?
**Answer:** No alerting on query performance degradation
**Evidence:** No alerts fired until connection pool exhausted
## Stopping Criteria Check
- [x] Actionable: Can add index, fix alerting
- [x] Controllable: Within our control
- [x] Fundamental: Index prevents query issue, alerting prevents impact
- [x] Evidenced: All steps have supporting data
- [x] System-focused: Process and tooling issues, not blame
## Root Causes Identified
1. **Primary:** Index migration process doesn't handle large tables
2. **Contributing:** No alerting on query latency before connection exhaustion
## Recommended Actions
| Action | Addresses | Owner | Timeline |
|--------|-----------|-------|----------|
| Implement online index creation tool | Root cause 1 | Platform | 2 weeks |
| Add query latency alerting | Root cause 2 | SRE | 1 week |
| Create index during maintenance window | Immediate fix | DBA | Tonight |
Common Patterns to Catch
The Blame Stop
BAD: "Why did it fail?" → "Engineer didn't test properly" → STOP
BETTER: → "Why was it possible to deploy without proper testing?"
→ "Why doesn't the pipeline enforce testing?"
→ System/process root cause
The Premature Technical Stop
BAD: "Why was it slow?" → "Query was inefficient" → STOP
BETTER: → "Why was an inefficient query in production?"
→ "Why didn't code review catch it?"
→ "Why don't we have query performance testing?"
The Circular Why
DETECT: "Why A?" → "Because B" → "Why B?" → "Because A"
BREAK: Introduce external evidence or third factor
The Speculation Dive
DETECT: Answers become increasingly speculative without evidence
BREAK: "What evidence do we have for this?"
If none, mark as hypothesis and seek evidence
Verification Checklist
- Problem stated with specific details (what, when, where, extent)
- Each "why" has supporting evidence
- "What else?" asked at each branch point
- Didn't stop at human error—asked "why was error possible?"
- Stopping criteria all satisfied
- Counter-analysis performed
- Root cause is actionable and controllable
- Actions address root cause, not just symptoms
Key Questions
- "What evidence supports this answer?"
- "What else could explain this?"
- "Why was this mistake/error/failure possible?"
- "If we stop here, will the problem actually be prevented?"
- "Are we finding what we expected, or what the evidence shows?"
- "Would someone outside our team reach the same conclusion?"
Ohno's Wisdom (Extended)
Taiichi Ohno said: "By asking 'why' five times and answering each time, we can get to the real cause of the problem."
The extension: Five is not magic. The real guidance is:
- Keep asking until you reach something actionable
- But don't speculate past your evidence
- And never stop at human blame
The technique is simple. Applying it well requires discipline.