five-whys-root-cause
5 Whys & Root Cause Analysis
Core principle: Every visible problem is a symptom. The root cause is the systemic condition that makes the symptom inevitable. Fix the symptom and it returns. Fix the root cause and the symptom disappears — along with related ones you haven't noticed yet.
The 5 Whys Method
Start with the problem statement. Ask "why did this happen?" Keep asking why until you reach a cause you can actually fix — one that is actionable, systemic, and doesn't have another why behind it.
The Process
Problem: [State clearly and specifically]
Why 1: [Immediate cause]
Why 2: [Cause of the cause]
Why 3: [Deeper cause]
Why 4: [Systemic cause]
Why 5: [Root cause — systemic, actionable]
Stop when you reach a cause that is:
- Actionable (you can actually change it)
- Systemic (fixing it prevents recurrence)
- Not just another symptom
Don't stop when:
- The answer is "human error" (that's a symptom — why did human error occur?)
- The answer is "bad luck" (luck is not a root cause)
- The answer blames a person (people operate within systems — blame the system)
Common Root Cause Categories
| Category | Examples |
|---|---|
| Process gap | No process existed, process was unclear, process wasn't followed |
| Knowledge gap | Team didn't know X, information wasn't shared, no documentation |
| Tooling gap | Wrong tool, missing tool, tool not configured correctly |
| Incentive misalignment | Rewards and desired behavior point in different directions |
| Feedback loop missing | No signal that something was going wrong |
| Assumption failure | A belief about the system was wrong |
| Capacity constraint | Not enough time, people, or resources to do it right |
| Communication failure | Decision made without relevant parties knowing |
Multi-Branch Analysis
The 5 Whys often reveals multiple root causes (not one). Use branching when you hit a "why" with more than one answer:
Problem: Deployment failed
Why? → Tests didn't catch the bug
Branch A: Why? → Test coverage was insufficient
Why? → No policy requiring coverage thresholds
Root cause A: No coverage gate in CI pipeline
Branch B: Why? → Test environment differed from production
Why? → Config was not environment-parity checked
Root cause B: No config parity validation step
Each branch may have a different fix.
Output Format
🔍 Problem Statement
Restate the problem specifically and observably:
- What happened?
- When? How often?
- What's the impact?
- What did not happen (that should have)?
🪢 The Why Chain(s)
Present the full chain(s) from symptom to root cause. Use branching if multiple causes exist.
🎯 Root Cause(s) Identified
For each root cause:
- Statement: Clear, systemic description
- Category: Process / Knowledge / Tooling / Incentive / Feedback / Assumption / Capacity / Communication
- Recurrence risk: How likely is this to cause the same or similar problem again without a fix?
🚫 Rejected Explanations
List causes that were considered but rejected, and why:
- "Human error" → Why did it occur? What enabled it?
- "One-off event" → Why was the system vulnerable to it?
- Personal blame → Replaced with systemic explanation
🛠️ Corrective Actions
For each root cause, a specific countermeasure:
- Action: What exactly changes?
- Owner: Who is responsible?
- Verification: How do we know the fix worked?
- Timeline: When?
📊 Recurrence Prevention
- What monitoring/alerting detects this root cause activating again?
- What review process catches this category of problem in the future?
- Is this root cause related to any other known issues? (often one root cause explains multiple symptoms)
Quality Checks for Root Causes
Before accepting a root cause, verify:
- ✅ Reversibility test: If we fixed this, would the problem go away?
- ✅ Recurrence test: If this root cause is present, does the problem reliably occur?
- ✅ Actionability test: Can we actually change this?
- ✅ Systemic test: Does this explain the pattern, not just the one incident?
- ❌ Blame test: If the answer names a person as the root cause, keep asking why
Common Traps to Avoid
- Stopping too early: "The developer made a mistake" is not a root cause
- Single-cause bias: Most failures have 2–4 contributing root causes
- Hindsight framing: "They should have known" — why didn't the system make it knowable?
- Fixing the last step: Catching the bug in prod is not the fix; preventing the condition that allowed it is
- Symptom substitution: Fixing one symptom without the root cause → a different symptom appears
Example Walk-Through
Problem: The agent pipeline produced incorrect output on 30% of runs last week.
| Level | Why? | Answer |
|---|---|---|
| Why 1 | Why incorrect output? | Agent used stale context from prior step |
| Why 2 | Why stale context? | Context was not cleared between runs |
| Why 3 | Why not cleared? | No reset mechanism in handoff protocol |
| Why 4 | Why no reset mechanism? | Handoff spec was never formalized |
| Why 5 | Why never formalized? | No process requires handoff specs before agent goes to production |
Root cause: No policy requiring formalized handoff specs before production deployment.
Fix: Add handoff spec requirement to the agent deployment checklist.
More from andurilcode/skills
causal-inference
Apply causal inference whenever the user is interpreting metrics, debugging system behavior, reading A/B test results, or trying to understand whether an observed change was caused by an action or by something else. Triggers on phrases like "X caused Y", "since we deployed this, metrics changed", "the A/B test showed a lift", "why did this metric move?", "is this correlation or causation?", "we changed X and Y improved", "how do we know this worked?", "the data shows…", or any situation where conclusions are being drawn from observational data. Also trigger before any decision based on metric interpretation — confusing correlation with causation leads to interventions that don't work and misattribution of credit. Never assume causation without applying this skill.
30probabilistic-thinking
Apply probabilistic and Bayesian thinking whenever the user needs to reason under uncertainty, compare risks, prioritize between options, update beliefs based on new evidence, or make decisions without complete information. Triggers on phrases like "what are the odds?", "how likely is this?", "should I be worried about X?", "which risk is bigger?", "does this data change anything?", "is this a signal or noise?", "what's the probability?", "how confident are we?", or any situation where decisions are being made based on incomplete or ambiguous evidence. Also trigger when someone is treating uncertain outcomes as certainties, or when probability language is being used loosely ("probably", "unlikely", "very likely") without quantification. Don't leave uncertainty unexamined.
27cognitive-bias-detection
Apply cognitive bias detection whenever the user (or Claude itself) is making an evaluation, recommendation, or decision that could be silently distorted by systematic thinking errors. Triggers on phrases like "I'm pretty sure", "obviously", "everyone agrees", "we already invested so much", "this has always worked", "just one more try", "I knew it", "the data confirms what we thought", "we can't go back now", or when analysis feels suspiciously aligned with what someone wanted to hear. Also trigger proactively when evaluating high-stakes decisions, plans with significant sunk costs, or conclusions that conveniently support the evaluator's existing position. The goal is not to paralyze — it's to flag where reasoning may be compromised so it can be corrected.
24inversion-premortem
Apply inversion and pre-mortem thinking whenever the user asks to evaluate a plan, strategy, architecture, feature, or decision before execution — or when they want to stress-test something that already exists. Triggers on phrases like "is this a good idea?", "what could go wrong?", "review this plan", "should we do this?", "are we missing anything?", "stress-test this", "what are the risks?", or any request to validate a decision or design. Use this skill proactively — if the user is about to commit to something, this skill should be consulted even if they don't ask for it explicitly.
23analogical-thinking
Apply analogical thinking whenever the user is designing a system, architecture, or process and would benefit from structural patterns that already exist in other domains — or when a problem feels novel but may have been solved elsewhere under a different name. Triggers on phrases like "how should we structure this?", "has anyone solved this before?", "we're designing from scratch", "what's a good model for this?", "I keep feeling like this resembles something", "what patterns apply here?", or when facing architecture, organizational, or process design decisions. Also trigger when a problem has been analyzed thoroughly but no good solution has emerged — the answer may exist in an adjacent domain. Don't reinvent what's been solved. Recognize the shape of the problem first.
22first-principles-thinking
Apply first principles thinking whenever the user is questioning whether a design, strategy, or solution is fundamentally right — not just well-executed. Triggers on phrases like "are we solving the right problem?", "why do we do it this way?", "is this the best approach?", "everyone does X but should we?", "we've always done it this way", "challenge our assumptions", "start from scratch", "is there a better way?", or when the user seems to be iterating on a flawed premise rather than questioning the premise itself. Also trigger when a proposed solution feels like an incremental improvement on something that may be fundamentally broken. Don't optimize a flawed foundation — question it first.
21