root-cause-analysis
SKILL.md
Root Cause Analysis
Systematic approaches for identifying the true source of problems, not just symptoms.
RCA Methods Overview
| Method | Best For | Complexity | Time |
|---|---|---|---|
| 5 Whys | Simple, linear problems | Low | 15-30 min |
| Fishbone | Multi-factor problems | Medium | 30-60 min |
| Fault Tree | Critical systems, safety | High | 1-4 hours |
| Timeline Analysis | Incident investigation | Medium | 30-90 min |
5 Whys Method
Iteratively ask "why" to drill down from symptom to root cause.
Process
Problem Statement: [Clear description of the issue]
│
▼
Why #1: [First level cause]
│
▼
Why #2: [Deeper cause]
│
▼
Why #3: [Even deeper]
│
▼
Why #4: [Getting to root]
│
▼
Why #5: [Root cause identified]
│
▼
Action: [Fix that addresses root cause]
Example: Production Outage
**Problem:** Website was down for 2 hours
**Why 1:** Why was the website down?
→ The application server ran out of memory and crashed.
**Why 2:** Why did the server run out of memory?
→ A memory leak in the image processing service accumulated over time.
**Why 3:** Why was there a memory leak?
→ The service wasn't releasing image buffers after processing.
**Why 4:** Why weren't buffers being released?
→ The cleanup code had a bug introduced in last week's release.
**Why 5:** Why wasn't the bug caught before release?
→ We don't have automated memory leak detection in our test suite.
**Root Cause:** Missing automated memory leak testing
**Action:** Add memory profiling to CI pipeline, add cleanup tests
5 Whys Best Practices
| Do | Don't |
|---|---|
| Base answers on evidence | Guess or assume |
| Stay focused on one causal chain | Branch too early |
| Keep asking until actionable | Stop at symptoms |
| Involve people closest to issue | Assign blame |
| Document your reasoning | Skip steps |
When 5 Whys Falls Short
- Multiple contributing factors (use Fishbone)
- Complex system interactions (use Fault Tree)
- Organizational/process issues (need broader analysis)
Fishbone Diagram (Ishikawa)
Visualize multiple potential causes organized by category.
Standard Categories (6 M's)
┌─────────────┐
Methods ────┤ │
│ │
Machines ─────┤ │
│ ├──── PROBLEM
Materials ─────┤ │
│ │
Measurement ────┤ │
│ │
Environment ────┤ │
│ │
People ──────┤ │
└─────────────┘
Software-Specific Categories
┌─────────────┐
Code ─────┤ │
│ │
Infrastructure ────┤ │
│ ├──── BUG/INCIDENT
Dependencies ────┤ │
│ │
Configuration ───┤ │
│ │
Process ────┤ │
│ │
People ─────┤ │
└─────────────┘
Fishbone Example: API Latency Spike
┌─────────────────┐
│ │
Code ─────────────────┤ │
│ │ │
├─ N+1 query issue │ │
├─ Missing index │ API LATENCY │
└─ Sync blocking call│ SPIKE │
│ │
Infrastructure ─────────────┤ │
│ │ │
├─ DB connection pool│ │
├─ Network saturation│ │
└─ Insufficient RAM │ │
│ │
Dependencies ───────────────┤ │
│ │ │
├─ External API slow │ │
├─ Redis timeout │ │
└─ CDN cache miss │ │
└─────────────────┘
Fishbone Process
- Define the problem clearly (the fish head)
- Identify major categories (the bones)
- Brainstorm causes for each category
- Analyze relationships between causes
- Prioritize most likely root causes
- Verify with data/testing
- Take action on confirmed causes
Fault Tree Analysis (FTA)
Top-down, deductive analysis for critical systems.
FTA Symbols
┌─────┐
│ TOP │ Top Event (the failure being analyzed)
└──┬──┘
│
┌──┴──┐
│ AND │ All inputs must occur for output
└─────┘
┌──┴──┐
│ OR │ Any input causes output
└─────┘
┌─────┐
│ ○ │ Basic Event (root cause)
└─────┘
┌─────┐
│ ◇ │ Undeveloped Event (needs more analysis)
└─────┘
FTA Example: Authentication Failure
┌────────────────────┐
│ USER CANNOT │
│ AUTHENTICATE │
└─────────┬──────────┘
│
┌───┴───┐
│ OR │
└───┬───┘
┌──────────────────┼──────────────────┐
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ Invalid │ │ Auth │ │ Account │
│ Credentials│ │ Service │ │ Locked │
│ │ │ Down │ │ │
└──────┬──────┘ └──────┬──────┘ └─────────────┘
│ │
┌───┴───┐ ┌───┴───┐
│ OR │ │ OR │
└───┬───┘ └───┬───┘
┌──────┼──────┐ ┌──────┼──────┐
│ │ │ │ │ │
○ ○ ○ ○ ○ ◇
Wrong Expired Token DB Redis External
Password Token Invalid Down Down Auth
When to Use FTA
- Safety-critical systems
- Complex failure modes
- Need to identify all paths to failure
- Regulatory compliance requirements
- Post-incident analysis for serious outages
Timeline Analysis
Reconstruct sequence of events to identify causation.
Timeline Template
## Incident Timeline: [Incident Name]
### Summary
- **Incident Start:** [Timestamp]
- **Incident Detected:** [Timestamp]
- **Incident Resolved:** [Timestamp]
- **Total Duration:** [X hours Y minutes]
- **Time to Detect:** [X minutes]
- **Time to Resolve:** [X hours Y minutes]
### Detailed Timeline
| Time (UTC) | Event | Source | Actor |
|------------|-------|--------|-------|
| 14:00 | Deployment started | CI/CD | automated |
| 14:05 | Deployment completed | CI/CD | automated |
| 14:15 | Error rate increased 10x | Monitoring | - |
| 14:22 | Alert fired | PagerDuty | - |
| 14:25 | On-call acknowledged | PagerDuty | @alice |
| 14:30 | Root cause identified | Investigation | @alice |
| 14:35 | Rollback initiated | Manual | @alice |
| 14:40 | Services recovered | Monitoring | - |
| 14:45 | Incident resolved | Manual | @alice |
### Analysis
**Contributing Factors:**
1. [Factor 1]
2. [Factor 2]
**What Went Well:**
1. [Positive observation]
**What Could Improve:**
1. [Improvement area]
### Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| | | | |
Debugging Decision Tree
Problem Reported
│
▼
Can you reproduce it?
│ │
Yes No
│ │
▼ ▼
Isolate the Gather more
conditions information
│ │
▼ ▼
Recent changes? Check logs,
│ monitoring
Yes │
│ │
▼ ▼
Review diffs Correlation
& deploys analysis
│ │
└─────┬─────┘
│
▼
Form hypothesis
│
▼
Test hypothesis
│
┌─────┴─────┐
│ │
Confirmed Rejected
│ │
▼ ▼
Fix and Next hypothesis
verify
RCA Documentation Template
## Root Cause Analysis: [Issue Title]
### Issue Summary
**Reported:** [Date]
**Severity:** P0 / P1 / P2 / P3
**Impact:** [Description of impact]
### Problem Statement
[Clear, specific description of what went wrong]
### Investigation
#### Timeline
[Key events in sequence]
#### Analysis Method Used
[ ] 5 Whys
[ ] Fishbone
[ ] Fault Tree
[ ] Timeline Analysis
#### Findings
[Detailed analysis results]
### Root Cause(s)
1. **Primary:** [Main root cause]
2. **Contributing:** [Secondary factors]
### Immediate Fix
[What was done to resolve the immediate issue]
### Preventive Actions
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| | | | |
### Lessons Learned
1. [Key takeaway]
2. [Process improvement]
### Appendix
- [Links to logs, graphs, related tickets]
Best Practices
- Blameless postmortems: Focus on systems, not individuals
- Automated correlation: Use AI to correlate signals across systems
- Proactive RCA: Analyze near-misses, not just incidents
- Knowledge sharing: Document and share RCA findings
- Metrics-driven: Track time-to-detect, time-to-resolve trends
Related Skills
observability-monitoring- Gathering data for RCAerrors- Error pattern analysisresilience-patterns- Preventing future incidents
References
Version: 1.0.0 (January )
Weekly Installs
8
Repository
yonatangross/orchestkitGitHub Stars
95
First Seen
Feb 2, 2026
Security Audits
Installed on
claude-code6
github-copilot5
gemini-cli5
opencode5
antigravity5
codex4