rca
RCA
Purpose
Build a branch-aware causal tree from apparent problem to defensible root causes. Classify each leaf by confidence. Produce concrete actions per cause.
Interaction Contract
- Ask one question at a time during interview phases.
- Keep the user informed when evidence changes branch direction.
- Confirm assumptions before promoting hypotheses to conclusions.
- If the user defers a question ("ask me later"), record the deferral, continue with the next question, and revisit before closing the interview phase.
- If the user contributes their own Q&A pairs, incorporate them into the tree exactly as you would an answer to your own question. Do not re-ask what the user has already answered.
Core Rules
- Interview first. No exceptions. Ask user questions to establish the baseline narrative before any codebase inspection. Do NOT open files, search code, or read logs until at least the five interview questions in Workflow §1 are answered.
- One question at a time. Do not comment on, interpret, or editorialize answers — record and ask the next question.
- Questions follow the incident narrative (what happened, why, what changed). Codebase observations can corroborate but must not hijack interview flow.
- Go wide and deep. Never stop at the first plausible cause.
- At every node, ask at least: Why did this happen? and Why wasn't it prevented or detected? This applies to symptoms and causes alike — a contributing cause can itself raise a prevention or detection question that becomes its own branch.
- Separate facts from assumptions. Never present a hypothesis as confirmed without evidence.
- Never assume a cause from partial answers — ask instead.
- Do not finish until every possible/actual root cause has at least one action and every open question has been resolved or explicitly deferred by the user.
Workflow
1. Interview
Ask one question at a time. Cover at minimum:
- What is the apparent problem?
- What failed, who was affected, and how severe was it?
- When did it start — is it ongoing or resolved?
- What evidence exists (logs, metrics, alerts, error messages, timelines)?
- What changed recently (deploys, config, data, dependencies)?
Do not read any files, logs, or code until all five questions above are answered. Never let codebase observations drive the questions.
2. Restate
Summarize the apparent problem in one sentence. Confirm with user if ambiguous.
3. Build Causal Tree
- Treat each symptom as a node.
- Ask "why?" recursively, adding child causes.
- At any node — symptom or cause — ask both: Why did this happen? and Why wasn't it prevented or detected? Questions are not limited to the top level; a contributing cause can itself raise a question that becomes its own branch.
- Stop when no deeper controllable cause exists or evidence is insufficient.
- For any Process branch, ask: has a similar incident occurred in adjacent systems, features, or teams? A second confirmed instance upgrades a "possible" systemic root cause to "actual (systemic)".
4. Label Each Leaf
| Label | Meaning |
|---|---|
actual root cause |
Evidence-backed |
possible root cause |
Plausible, unverified |
unknown |
Insufficient evidence |
Distinguish cause types where useful: proximate, contributing, systemic.
5. Produce Document
See Deliverable section.
Branch Lenses
Apply to expand weak branches:
| Lens | Examples |
|---|---|
| Technical | Code defects, architecture, dependencies, config, infra, networking |
| Data | Bad inputs, schema drift, migrations, stale/incorrect data |
| Process | Change management, rollout, testing, review, incident response, automation |
| Detection | Monitoring gaps, alert tuning, observability blind spots |
| Human/Org | Ownership ambiguity, handoff failures, staffing/load, training |
| External | Third-party outages, vendor/API behavior, environmental constraints |
If a branch is weak, collect evidence or downgrade confidence.
Evidence and Confidence
For each node record:
| Field | Values |
|---|---|
| Evidence source | logs, metrics, timeline, code diff, interview |
| Confidence | high / medium / low |
| Status | confirmed / hypothesis / unknown |
Deliverable
One Markdown document with these sections:
# Root Cause Analysis: <problem>## Problem Statement## Impact and Scope## Timeline (if known)## Analysis Tree— Mermaid causal tree (example below)## Node Details— node ID, statement, evidence, confidence, status## Identified Root Causes## Recommended Actions— one row per cause: ID, action, type, expected effect, priority (P0–P3), owner, target date, verification metric## Open Questions— only questions that are genuinely unresolvable or that the user explicitly said they cannot answer; never a holding area for questions not yet asked
flowchart TD
P0["P0: Apparent problem"]
P0 --> S1["S1: Symptom A"]
P0 --> S2["S2: Symptom B"]
S1 --> C1["C1: Proximate cause"]
C1 --> C2["C2: Contributing cause"]
C2 --> R1["R1: Actual root cause"]
C2 --> R2["R2: Possible root cause"]
R2 --> R3["R3: Actual root cause"]
S1 --> G1["G1: Why not detected sooner?"]
G1 --> R4["R4: Actual root cause"]
C1 --> G2["G2: Why wasn't this prevented?"]
G2 --> R5["R5: Actual root cause"]
S2 --> C3["C3: Proximate cause"]
C3 --> R6["R6: Actual root cause"]
Action types: containment · mitigation · corrective · preventive
Stop Condition
Stop only when all three conditions are met:
- Every root-cause branch is labeled (
actual root cause,possible root cause, orunknown). - Every branch has at least one action.
- Every open question has been resolved or the user mentioned explicitly they cannot answer it.
Do not stop while questions remain unanswered — continue the interview. ## Open Questions is only for questions that are genuinely unresolvable (missing evidence, dependencies outside of your control) or that the user has explicitly deferred; it is not a place to park questions you have not yet asked.
Continuing After the Deliverable
If the user opens a new line of inquiry after the document is produced, treat it as a new interview round:
- Ask follow-up questions one at a time as before.
- Add new nodes to the existing tree; revise labels if evidence changes.
- Extend the Recommended Actions table with any new causes.
- Re-emit the full updated document when the new round is complete.
The deliverable is a living document, not a final gate.
More from goblindegook/skills
grill-me
Use when user wants to stress-test an idea or plan, review a design, or says "grill me".
10tdd
Use when implementing behavior changes or bug fixes where automated tests can drive the implementation, or when asked to use TDD.
8roundtable
Review a project or feature from the point of view of a set of personas. Use this when the user asks for multi-perspective critique — "review from the POV of X, Y, Z", "what would a [role] think of this?", "roundtable review", "get different perspectives on this". Each persona reads the actual source code before commenting, then all personas discuss together to agree on a prioritised top 5 list. If no personas are provided, ask the user and offer suggestions.
4