skills/yonatangross/orchestkit/root-cause-analysis

root-cause-analysis

SKILL.md

Root Cause Analysis

Systematic approaches for identifying the true source of problems, not just symptoms.

RCA Methods Overview

Method Best For Complexity Time
5 Whys Simple, linear problems Low 15-30 min
Fishbone Multi-factor problems Medium 30-60 min
Fault Tree Critical systems, safety High 1-4 hours
Timeline Analysis Incident investigation Medium 30-90 min

5 Whys Method

Iteratively ask "why" to drill down from symptom to root cause.

Process

Problem Statement: [Clear description of the issue]
Why #1: [First level cause]
Why #2: [Deeper cause]
Why #3: [Even deeper]
Why #4: [Getting to root]
Why #5: [Root cause identified]
Action: [Fix that addresses root cause]

Example: Production Outage

**Problem:** Website was down for 2 hours

**Why 1:** Why was the website down?
→ The application server ran out of memory and crashed.

**Why 2:** Why did the server run out of memory?
→ A memory leak in the image processing service accumulated over time.

**Why 3:** Why was there a memory leak?
→ The service wasn't releasing image buffers after processing.

**Why 4:** Why weren't buffers being released?
→ The cleanup code had a bug introduced in last week's release.

**Why 5:** Why wasn't the bug caught before release?
→ We don't have automated memory leak detection in our test suite.

**Root Cause:** Missing automated memory leak testing
**Action:** Add memory profiling to CI pipeline, add cleanup tests

5 Whys Best Practices

Do Don't
Base answers on evidence Guess or assume
Stay focused on one causal chain Branch too early
Keep asking until actionable Stop at symptoms
Involve people closest to issue Assign blame
Document your reasoning Skip steps

When 5 Whys Falls Short

  • Multiple contributing factors (use Fishbone)
  • Complex system interactions (use Fault Tree)
  • Organizational/process issues (need broader analysis)

Fishbone Diagram (Ishikawa)

Visualize multiple potential causes organized by category.

Standard Categories (6 M's)

                    ┌─────────────┐
        Methods ────┤             │
                    │             │
      Machines ─────┤             │
                    │             ├──── PROBLEM
     Materials ─────┤             │
                    │             │
    Measurement ────┤             │
                    │             │
    Environment ────┤             │
                    │             │
       People ──────┤             │
                    └─────────────┘

Software-Specific Categories

                    ┌─────────────┐
          Code ─────┤             │
                    │             │
 Infrastructure ────┤             │
                    │             ├──── BUG/INCIDENT
   Dependencies ────┤             │
                    │             │
   Configuration ───┤             │
                    │             │
        Process ────┤             │
                    │             │
        People ─────┤             │
                    └─────────────┘

Fishbone Example: API Latency Spike

                              ┌─────────────────┐
                              │                 │
        Code ─────────────────┤                 │
         │                    │                 │
         ├─ N+1 query issue   │                 │
         ├─ Missing index     │   API LATENCY   │
         └─ Sync blocking call│      SPIKE      │
                              │                 │
  Infrastructure ─────────────┤                 │
         │                    │                 │
         ├─ DB connection pool│                 │
         ├─ Network saturation│                 │
         └─ Insufficient RAM  │                 │
                              │                 │
  Dependencies ───────────────┤                 │
         │                    │                 │
         ├─ External API slow │                 │
         ├─ Redis timeout     │                 │
         └─ CDN cache miss    │                 │
                              └─────────────────┘

Fishbone Process

  1. Define the problem clearly (the fish head)
  2. Identify major categories (the bones)
  3. Brainstorm causes for each category
  4. Analyze relationships between causes
  5. Prioritize most likely root causes
  6. Verify with data/testing
  7. Take action on confirmed causes

Fault Tree Analysis (FTA)

Top-down, deductive analysis for critical systems.

FTA Symbols

┌─────┐
│ TOP │  Top Event (the failure being analyzed)
└──┬──┘
┌──┴──┐
│ AND │  All inputs must occur for output
└─────┘

┌──┴──┐
│ OR  │  Any input causes output
└─────┘

┌─────┐
│  ○  │  Basic Event (root cause)
└─────┘

┌─────┐
│  ◇  │  Undeveloped Event (needs more analysis)
└─────┘

FTA Example: Authentication Failure

                    ┌────────────────────┐
                    │   USER CANNOT      │
                    │   AUTHENTICATE     │
                    └─────────┬──────────┘
                          ┌───┴───┐
                          │  OR   │
                          └───┬───┘
           ┌──────────────────┼──────────────────┐
           │                  │                  │
    ┌──────┴──────┐    ┌──────┴──────┐    ┌──────┴──────┐
    │  Invalid    │    │   Auth      │    │  Account    │
    │  Credentials│    │   Service   │    │  Locked     │
    │             │    │   Down      │    │             │
    └──────┬──────┘    └──────┬──────┘    └─────────────┘
           │                  │
       ┌───┴───┐          ┌───┴───┐
       │  OR   │          │  OR   │
       └───┬───┘          └───┬───┘
    ┌──────┼──────┐    ┌──────┼──────┐
    │      │      │    │      │      │
   ○       ○      ○    ○      ○      ◇
 Wrong   Expired Token DB   Redis  External
Password  Token  Invalid Down  Down   Auth

When to Use FTA

  • Safety-critical systems
  • Complex failure modes
  • Need to identify all paths to failure
  • Regulatory compliance requirements
  • Post-incident analysis for serious outages

Timeline Analysis

Reconstruct sequence of events to identify causation.

Timeline Template

## Incident Timeline: [Incident Name]

### Summary
- **Incident Start:** [Timestamp]
- **Incident Detected:** [Timestamp]
- **Incident Resolved:** [Timestamp]
- **Total Duration:** [X hours Y minutes]
- **Time to Detect:** [X minutes]
- **Time to Resolve:** [X hours Y minutes]

### Detailed Timeline

| Time (UTC) | Event | Source | Actor |
|------------|-------|--------|-------|
| 14:00 | Deployment started | CI/CD | automated |
| 14:05 | Deployment completed | CI/CD | automated |
| 14:15 | Error rate increased 10x | Monitoring | - |
| 14:22 | Alert fired | PagerDuty | - |
| 14:25 | On-call acknowledged | PagerDuty | @alice |
| 14:30 | Root cause identified | Investigation | @alice |
| 14:35 | Rollback initiated | Manual | @alice |
| 14:40 | Services recovered | Monitoring | - |
| 14:45 | Incident resolved | Manual | @alice |

### Analysis

**Contributing Factors:**
1. [Factor 1]
2. [Factor 2]

**What Went Well:**
1. [Positive observation]

**What Could Improve:**
1. [Improvement area]

### Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| | | | |

Debugging Decision Tree

                    Problem Reported
               Can you reproduce it?
                    │           │
                   Yes          No
                    │           │
                    ▼           ▼
            Isolate the      Gather more
            conditions       information
                    │           │
                    ▼           ▼
            Recent changes?  Check logs,
                    │        monitoring
                   Yes          │
                    │           │
                    ▼           ▼
            Review diffs    Correlation
            & deploys       analysis
                    │           │
                    └─────┬─────┘
                   Form hypothesis
                    Test hypothesis
                    ┌─────┴─────┐
                    │           │
               Confirmed     Rejected
                    │           │
                    ▼           ▼
               Fix and      Next hypothesis
               verify

RCA Documentation Template

## Root Cause Analysis: [Issue Title]

### Issue Summary
**Reported:** [Date]
**Severity:** P0 / P1 / P2 / P3
**Impact:** [Description of impact]

### Problem Statement
[Clear, specific description of what went wrong]

### Investigation

#### Timeline
[Key events in sequence]

#### Analysis Method Used
[ ] 5 Whys
[ ] Fishbone
[ ] Fault Tree
[ ] Timeline Analysis

#### Findings
[Detailed analysis results]

### Root Cause(s)
1. **Primary:** [Main root cause]
2. **Contributing:** [Secondary factors]

### Immediate Fix
[What was done to resolve the immediate issue]

### Preventive Actions
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| | | | |

### Lessons Learned
1. [Key takeaway]
2. [Process improvement]

### Appendix
- [Links to logs, graphs, related tickets]

Best Practices

  • Blameless postmortems: Focus on systems, not individuals
  • Automated correlation: Use AI to correlate signals across systems
  • Proactive RCA: Analyze near-misses, not just incidents
  • Knowledge sharing: Document and share RCA findings
  • Metrics-driven: Track time-to-detect, time-to-resolve trends

Related Skills

  • observability-monitoring - Gathering data for RCA
  • errors - Error pattern analysis
  • resilience-patterns - Preventing future incidents

References

Version: 1.0.0 (January )

Weekly Installs
8
GitHub Stars
95
First Seen
Feb 2, 2026
Installed on
claude-code6
github-copilot5
gemini-cli5
opencode5
antigravity5
codex4