incident-report
Incident Report Skill
When generating an incident report, follow this structured process. The goal is to produce a blameless postmortem that helps the team learn from what happened and prevents recurrence — not to assign blame.
IMPORTANT: Always save the output as a markdown file in the project-decisions/ directory at the project root. Create the directory if it doesn't exist.
PRINCIPLE: This is a BLAMELESS postmortem. Focus on systems, processes, and conditions — never on individuals. Replace "Person X did wrong" with "The system allowed X to happen without safeguards."
0. Output Setup
# Create project-decisions directory if it doesn't exist
mkdir -p project-decisions
# File will be saved as:
# project-decisions/YYYY-MM-DD-incident-[kebab-case-topic].md
# Example: project-decisions/2026-02-19-incident-payment-processing-outage.md
1. Incident Discovery — Gather the Facts
Recent Deployments
# Recent deployments / merges to main
git log --oneline --merges --since="7 days ago" main 2>/dev/null | head -20
git log --oneline --since="7 days ago" main 2>/dev/null | head -30
# Tags / releases in the last week
git tag --sort=-creatordate | head -10
# What was deployed most recently?
git log --oneline -1 main
git log --format="%H %ai %s" -1 main
# Who deployed and when?
git log --format="%ai — %an — %s" --since="3 days ago" main | head -20
# Diff between last two deployments
LATEST=$(git log --oneline -1 main | cut -d' ' -f1)
PREVIOUS=$(git log --oneline -2 main | tail -1 | cut -d' ' -f1)
git diff --stat $PREVIOUS $LATEST 2>/dev/null
git diff --name-only $PREVIOUS $LATEST 2>/dev/null
Recent Code Changes
# Files changed in the last 3 days
git log --name-only --since="3 days ago" --format="" main 2>/dev/null | sort | uniq -c | sort -rn | head -20
# Most changed files recently (hot spots)
git log --name-only --since="7 days ago" --format="" main 2>/dev/null | sort | uniq -c | sort -rn | head -10
# Changes to critical areas (auth, payments, database)
git log --oneline --since="7 days ago" -- src/auth/ src/payment/ src/db/ 2>/dev/null | head -10
# Changes to configuration
git log --oneline --since="7 days ago" -- "*.config.*" "*.env*" "docker-compose*" "*.yaml" "*.yml" 2>/dev/null | head -10
# Changes to infrastructure
git log --oneline --since="7 days ago" -- Dockerfile docker-compose* k8s/ terraform/ .github/workflows/ 2>/dev/null | head -10
# Changes to dependencies
git log --oneline --since="7 days ago" -- package.json package-lock.json yarn.lock requirements.txt Gemfile.lock 2>/dev/null | head -10
Error Patterns in Code
# Check for error handling in recently changed files
for file in $(git diff --name-only $PREVIOUS $LATEST 2>/dev/null); do
echo "=== $file ==="
grep -n "catch\|except\|rescue\|error\|throw\|panic\|fatal" "$file" 2>/dev/null | head -5
done
# Check for recent TODO/FIXME/HACK that might be relevant
grep -rn "TODO\|FIXME\|HACK\|XXX\|WORKAROUND\|TEMPORARY" --include="*.ts" --include="*.js" --include="*.py" src/ 2>/dev/null | grep -i "[incident-keyword]"
# Check for missing error handling in affected area
grep -rn "[incident-area-keyword]" --include="*.ts" --include="*.js" --include="*.py" src/ 2>/dev/null | head -20
# Check for known issues in the area
grep -rn "KNOWN.*ISSUE\|BUG\|REGRESSION" --include="*.ts" --include="*.js" --include="*.py" --include="*.md" . 2>/dev/null | head -10
Environment & Configuration
# Check environment configuration
cat .env.example 2>/dev/null | head -30
# Check for recent config changes
git log --oneline --since="7 days ago" -- "*.config.*" "*.env*" config/ 2>/dev/null
# Check Docker/infrastructure config
cat docker-compose.yml 2>/dev/null | head -50
cat Dockerfile 2>/dev/null | head -30
# Check for health checks
grep -rn "health\|readiness\|liveness\|heartbeat" --include="*.ts" --include="*.js" --include="*.py" --include="*.yaml" --include="*.yml" . 2>/dev/null | head -10
# Check monitoring configuration
grep -rn "sentry\|datadog\|newrelic\|prometheus\|grafana\|pagerduty\|opsgenie" --include="*.ts" --include="*.js" --include="*.py" --include="*.yaml" --include="*.json" . 2>/dev/null | head -10
Database State
# Recent migrations
ls -la src/db/migrations/ db/migrations/ migrations/ prisma/migrations/ 2>/dev/null | tail -10
# Recent migration changes
git log --oneline --since="7 days ago" -- "**/migrations/**" "prisma/" 2>/dev/null | head -10
# Check for schema changes
git diff --name-only $PREVIOUS $LATEST 2>/dev/null | grep -i "migration\|schema\|prisma"
2. Incident Classification
Severity Levels
| Level | Name | Definition | Examples |
|---|---|---|---|
| SEV-1 | Critical | Complete service outage affecting all users, data loss, or security breach | Site down, database corruption, credential leak |
| SEV-2 | Major | Significant degradation affecting many users, core functionality broken | Payment processing failing, auth broken for 50%+ users |
| SEV-3 | Minor | Partial degradation affecting some users, workaround available | Search not working, slow page loads, one API endpoint failing |
| SEV-4 | Low | Minimal impact, cosmetic issues, edge case bugs in production | UI glitch, typo, non-critical feature broken for small user segment |
Incident Categories
| Category | Description |
|---|---|
| Availability | Service down or unreachable |
| Performance | Service slow or degraded |
| Data | Data loss, corruption, or inconsistency |
| Security | Unauthorized access, data breach, vulnerability exploited |
| Functionality | Feature broken or behaving incorrectly |
| Integration | Third-party service failure or miscommunication |
| Infrastructure | Server, network, DNS, or cloud provider issue |
| Deployment | Bad deploy, failed rollback, configuration error |
| Capacity | Resource exhaustion (disk, memory, connections, rate limits) |
| Dependency | Upstream service failure cascading to our system |
3. Timeline Construction
Build a precise timeline of events:
## Timeline
All times in [timezone — e.g., UTC]
| Time | Event | Source |
|------|-------|--------|
| YYYY-MM-DD HH:MM | [Normal state — last known good] | [monitoring/logs] |
| YYYY-MM-DD HH:MM | [Triggering event — deployment, config change, traffic spike] | [deploy log/git] |
| YYYY-MM-DD HH:MM | [First symptom — error rate increase, latency spike] | [monitoring] |
| YYYY-MM-DD HH:MM | [Detection — alert fired, user report, team noticed] | [PagerDuty/Slack] |
| YYYY-MM-DD HH:MM | [Response started — on-call engaged, investigation began] | [Slack/team] |
| YYYY-MM-DD HH:MM | [Diagnosis — root cause identified or hypothesized] | [team] |
| YYYY-MM-DD HH:MM | [Mitigation action taken — rollback, fix, restart, config change] | [deploy log/git] |
| YYYY-MM-DD HH:MM | [Partial recovery — some users restored] | [monitoring] |
| YYYY-MM-DD HH:MM | [Full recovery — service fully restored] | [monitoring] |
| YYYY-MM-DD HH:MM | [Verification — confirmed stable, monitoring clean] | [monitoring] |
| YYYY-MM-DD HH:MM | [Incident closed] | [team] |
Key Metrics
| Metric | Value |
|--------|-------|
| **Time to detect (TTD)** | X minutes (from trigger to detection) |
| **Time to respond (TTR)** | X minutes (from detection to first responder) |
| **Time to mitigate (TTM)** | X minutes (from response to mitigation) |
| **Time to resolve (TTR)** | X minutes (from trigger to full recovery) |
| **Total duration** | X hours X minutes |
| **User-facing impact duration** | X hours X minutes |
4. Root Cause Analysis
The 5 Whys
Drill down from the symptom to the root cause:
1. WHY did [the incident happen]?
→ Because [immediate cause]
2. WHY did [immediate cause] happen?
→ Because [deeper cause]
3. WHY did [deeper cause] happen?
→ Because [even deeper cause]
4. WHY did [even deeper cause] happen?
→ Because [systemic cause]
5. WHY did [systemic cause] exist?
→ Because [root cause — process, system, or cultural gap]
Contributing Factors
Incidents rarely have a single cause. Identify all contributing factors:
| Factor | Type | Contribution |
|--------|------|-------------|
| [Factor 1 — e.g., untested migration] | Technical | Primary cause |
| [Factor 2 — e.g., no staging test] | Process | Allowed primary cause to reach production |
| [Factor 3 — e.g., no rollback plan] | Process | Extended the incident duration |
| [Factor 4 — e.g., missing monitoring] | Observability | Delayed detection |
| [Factor 5 — e.g., no runbook] | Documentation | Slowed response |
Cause Categories
Classify the root cause:
| Category | Examples |
|---|---|
| Code defect | Bug in logic, missing error handling, race condition |
| Configuration error | Wrong env var, bad config value, missing secret |
| Deployment issue | Bad deploy process, missing migration, dependency conflict |
| Infrastructure | Server failure, network issue, resource exhaustion |
| Dependency failure | Third-party API down, upstream service degraded |
| Data issue | Bad data, missing data, data migration failure |
| Capacity | Traffic spike, resource limits hit, connection pool exhausted |
| Process gap | Missing review, no testing, no monitoring, no runbook |
| Communication | Team not notified, unclear ownership, missing context |
| Design flaw | Architectural limitation, missing circuit breaker, no retry logic |
Fault Tree (for Complex Incidents)
[INCIDENT: Service Outage]
│
┌─────────┼─────────┐
│ │ │
[Database [App [Monitoring
timeout] crash] didn't alert]
│ │ │
┌─────┴─────┐ │ [Alert threshold
│ │ │ too high]
[Long-running [Connection │
query] pool full] [Not configured
│ │ for this metric]
[Missing [Connection
index] leak in
new code]
5. Impact Assessment
User Impact
| Metric | Value |
|--------|-------|
| **Users affected** | X (Y% of total) |
| **Requests failed** | X (Y% error rate) |
| **Revenue impact** | $X (estimated lost transactions) |
| **Data affected** | X records (lost / corrupted / delayed) |
| **SLA impact** | X minutes of downtime against Y% SLA target |
| **Support tickets** | X tickets generated |
| **Public visibility** | [status page updated / social media mentions / press] |
System Impact
| System | Impact | Duration |
|--------|--------|----------|
| [API] | 503 errors on all endpoints | X minutes |
| [Frontend] | Error page displayed | X minutes |
| [Mobile app] | Requests timing out | X minutes |
| [Background jobs] | Queue backlog of X jobs | X minutes |
| [Partner API] | Webhook delivery failed for X events | X minutes |
| [Database] | Connection pool exhausted | X minutes |
| [Cache] | Cache invalidated, cold start | X minutes |
Business Impact
| Area | Impact | Quantified |
|------|--------|-----------|
| **Revenue** | Lost transactions during outage | $X |
| **SLA** | SLA credit owed to customers | $X |
| **Reputation** | Trust impact, social media | [Low/Medium/High] |
| **Compliance** | Regulatory reporting required? | [Yes/No] |
| **Internal** | Engineering time spent on incident | X person-hours |
| **Opportunity** | Delayed feature work due to incident | X days |
6. Response Evaluation
What Went Well
✅ Things that worked during the incident:
- [e.g., Alert fired within 2 minutes of first error]
- [e.g., On-call responded within 5 minutes]
- [e.g., Rollback was clean and fast]
- [e.g., Status page was updated promptly]
- [e.g., Communication in Slack was clear and organized]
- [e.g., Customer support had talking points ready quickly]
What Didn't Go Well
❌ Things that didn't work or slowed us down:
- [e.g., Took 30 minutes to identify which deploy caused the issue]
- [e.g., No runbook existed for this failure mode]
- [e.g., Logs were insufficient to diagnose the problem]
- [e.g., Rollback required manual database intervention]
- [e.g., On-call didn't have access to production logs]
- [e.g., Status page wasn't updated for 45 minutes]
Where We Got Lucky
🍀 Things that could have made it worse:
- [e.g., Happened during low-traffic hours — peak would have been 10x worse]
- [e.g., Only affected one region — could have been global]
- [e.g., No data was permanently lost — could have been unrecoverable]
- [e.g., A team member happened to be online who knew this system]
7. Action Items
Remediation Actions
Categorize action items by urgency and type:
Immediate (0-48 hours) — Prevent Recurrence of THIS Incident
| # | Action | Type | Owner | Deadline | Status |
|---|--------|------|-------|----------|--------|
| 1 | [Fix the specific bug/config that caused the incident] | Fix | [Name] | [Date] | ⬜ TODO |
| 2 | [Add monitoring for the specific failure mode] | Detection | [Name] | [Date] | ⬜ TODO |
| 3 | [Write runbook for this incident type] | Documentation | [Name] | [Date] | ⬜ TODO |
Short-Term (1-2 weeks) — Harden Against Similar Incidents
| # | Action | Type | Owner | Deadline | Status |
|---|--------|------|-------|----------|--------|
| 4 | [Add integration test for the failure scenario] | Prevention | [Name] | [Date] | ⬜ TODO |
| 5 | [Improve logging in affected area] | Detection | [Name] | [Date] | ⬜ TODO |
| 6 | [Add circuit breaker for external dependency] | Resilience | [Name] | [Date] | ⬜ TODO |
| 7 | [Update deployment checklist] | Process | [Name] | [Date] | ⬜ TODO |
Medium-Term (1-3 months) — Systemic Improvements
| # | Action | Type | Owner | Deadline | Status |
|---|--------|------|-------|----------|--------|
| 8 | [Implement automated canary deployments] | Prevention | [Name] | [Date] | ⬜ TODO |
| 9 | [Add load testing to CI/CD pipeline] | Prevention | [Name] | [Date] | ⬜ TODO |
| 10 | [Redesign the system to eliminate single point of failure] | Architecture | [Name] | [Date] | ⬜ TODO |
| 11 | [Implement chaos engineering practices] | Resilience | [Name] | [Date] | ⬜ TODO |
Action Item Categories
| Type | Purpose | Examples |
|---|---|---|
| Fix | Directly fix the cause | Bug fix, config correction, data repair |
| Detection | Catch it faster next time | Monitoring, alerting, logging |
| Prevention | Stop it from happening | Tests, validation, guardrails, automation |
| Resilience | Reduce impact when it does happen | Circuit breakers, fallbacks, graceful degradation |
| Process | Improve how we work | Checklists, review gates, runbooks |
| Documentation | Capture knowledge | Runbooks, architecture docs, decision records |
| Architecture | Structural improvements | Eliminate SPOF, add redundancy, decouple |
8. Lessons Learned
## Key Takeaways
1. **[Lesson 1]**
What happened: [brief description]
What we learned: [insight]
How we'll apply it: [specific action]
2. **[Lesson 2]**
What happened: [brief description]
What we learned: [insight]
How we'll apply it: [specific action]
3. **[Lesson 3]**
What happened: [brief description]
What we learned: [insight]
How we'll apply it: [specific action]
9. Prevention Framework
Detection Improvements
What should have caught this earlier:
| Gap | Current State | Target State | Action Item |
|-----|---------------|-------------|-------------|
| [Monitoring gap] | No alert for X | Alert when X > threshold | Add dashboard + alert |
| [Logging gap] | No logs for Y operation | Structured logs with context | Add logging to service |
| [Testing gap] | No test for Z scenario | Integration test covering Z | Write test |
Process Improvements
What process changes would prevent this:
| Gap | Current Process | Improved Process | Action Item |
|-----|----------------|-----------------|-------------|
| [Review gap] | No security review on config changes | Config changes require review | Update PR checklist |
| [Deploy gap] | Direct deploy to production | Staged rollout with canary | Implement canary deploys |
| [Runbook gap] | No runbook for this failure | Documented response procedure | Write runbook |
Architectural Improvements
What system design changes would help:
| Weakness | Current Design | Improved Design | Effort | Priority |
|----------|---------------|-----------------|--------|----------|
| [SPOF] | Single database | Read replicas + failover | Large | High |
| [No fallback] | Hard dependency on API | Circuit breaker + cache fallback | Medium | High |
| [No isolation] | Shared connection pool | Per-service connection pools | Medium | Medium |
10. Incident Patterns
Check for Recurring Patterns
# Check for previous incidents in the same area
ls project-decisions/ 2>/dev/null | grep "incident"
# Check for previous related issues in git
git log --oneline --all --grep="fix\|hotfix\|revert\|rollback" --since="6 months ago" | head -20
# Check for reverted changes (indicator of past incidents)
git log --oneline --all --grep="revert\|Revert" --since="6 months ago" | head -10
# Check for similar patterns
git log --oneline --all --grep="[incident-keyword]" --since="6 months ago" | head -10
Recurring Incident Check
Is this a recurring incident?
| Question | Answer |
|----------|--------|
| Has this exact issue happened before? | [Yes — when / No] |
| Have similar issues happened in this area? | [Yes — describe / No] |
| Were previous action items completed? | [Yes / Partially / No] |
| Is this a variant of a known weakness? | [Yes — which / No] |
| Is there a systemic pattern? | [Yes — describe / No] |
If recurring, escalate the priority of systemic fixes.
Output Document Template
Save to project-decisions/YYYY-MM-DD-incident-[topic].md:
# Incident Report: [Title]
**Incident ID:** INC-[number]
**Date:** YYYY-MM-DD
**Severity:** [SEV-1 / SEV-2 / SEV-3 / SEV-4]
**Category:** [Availability / Performance / Data / Security / etc.]
**Status:** [Investigating / Mitigated / Resolved / Postmortem Complete]
**Duration:** X hours Y minutes
**Author:** [Name]
**Reviewers:** [Names]
---
## Executive Summary
[2-3 sentences: what happened, what was the impact, what was the root cause, what are we doing about it]
---
## Impact
| Metric | Value |
|--------|-------|
| **Duration** | X hours Y minutes |
| **Users affected** | X (Y% of total) |
| **Requests failed** | X |
| **Revenue impact** | $X |
| **SLA impact** | X minutes against Y% target |
---
## Timeline
All times in UTC.
| Time | Event |
|------|-------|
| HH:MM | [Event 1] |
| HH:MM | [Event 2] |
| HH:MM | **⚠️ Incident begins** |
| HH:MM | [Detection] |
| HH:MM | [Response] |
| HH:MM | [Diagnosis] |
| HH:MM | [Mitigation] |
| HH:MM | **✅ Incident resolved** |
**Time to detect:** X minutes
**Time to respond:** X minutes
**Time to mitigate:** X minutes
**Total duration:** X hours Y minutes
---
## Root Cause
### Summary
[1-2 sentences describing the root cause]
### 5 Whys
1. Why did [symptom]? → [cause 1]
2. Why did [cause 1]? → [cause 2]
3. Why did [cause 2]? → [cause 3]
4. Why did [cause 3]? → [cause 4]
5. Why did [cause 4]? → **[root cause]**
### Contributing Factors
| Factor | Type | Contribution |
|--------|------|-------------|
| [Factor 1] | [Technical/Process/etc.] | [Primary/Contributing] |
| [Factor 2] | [Technical/Process/etc.] | [Contributing] |
### Technical Details
[Detailed technical explanation of what went wrong, with code references if applicable]
[Relevant code snippet or configuration that caused/contributed to the issue]
---
## Response Evaluation
### What Went Well ✅
- [Thing 1]
- [Thing 2]
- [Thing 3]
### What Didn't Go Well ❌
- [Thing 1]
- [Thing 2]
- [Thing 3]
### Where We Got Lucky 🍀
- [Thing 1]
- [Thing 2]
---
## Action Items
### Immediate (0-48 hours)
| # | Action | Type | Owner | Deadline | Status |
|---|--------|------|-------|----------|--------|
| 1 | [Action] | [Fix/Detection/Prevention] | [Name] | [Date] | ⬜ |
| 2 | [Action] | [Fix/Detection/Prevention] | [Name] | [Date] | ⬜ |
### Short-Term (1-2 weeks)
| # | Action | Type | Owner | Deadline | Status |
|---|--------|------|-------|----------|--------|
| 3 | [Action] | [Prevention/Resilience] | [Name] | [Date] | ⬜ |
| 4 | [Action] | [Process/Documentation] | [Name] | [Date] | ⬜ |
### Medium-Term (1-3 months)
| # | Action | Type | Owner | Deadline | Status |
|---|--------|------|-------|----------|--------|
| 5 | [Action] | [Architecture/Prevention] | [Name] | [Date] | ⬜ |
| 6 | [Action] | [Architecture/Prevention] | [Name] | [Date] | ⬜ |
---
## Lessons Learned
1. **[Lesson 1]**: [What we learned and how we'll apply it]
2. **[Lesson 2]**: [What we learned and how we'll apply it]
3. **[Lesson 3]**: [What we learned and how we'll apply it]
---
## Recurring Incident Check
| Question | Answer |
|----------|--------|
| Has this happened before? | [Yes/No] |
| Were previous actions completed? | [Yes/No/N/A] |
| Is there a systemic pattern? | [Yes/No] |
---
## Appendix
### A. Related Commits
[List of relevant commits with hashes and messages]
### B. Monitoring Screenshots
[Links to dashboards, error rate graphs, latency charts]
### C. Communication Log
[Key Slack messages, status page updates, customer communications]
### D. Related Incidents
[Links to previous related incident reports]
---
## Decision Log
| Date | Event | By |
|------|-------|----|
| YYYY-MM-DD | Incident detected | [Name] |
| YYYY-MM-DD | Incident mitigated | [Name] |
| YYYY-MM-DD | Incident resolved | [Name] |
| YYYY-MM-DD | Postmortem drafted | [Name] |
| YYYY-MM-DD | Postmortem reviewed | [Name] |
| YYYY-MM-DD | Action items assigned | [Name] |
| YYYY-MM-DD | All action items completed | [Name] |
After saving, update the project-decisions index:
# Update README.md index
echo "# Project Decisions\n" > project-decisions/README.md
echo "| Date | Decision | Type | Status |" >> project-decisions/README.md
echo "|------|----------|------|--------|" >> project-decisions/README.md
for f in project-decisions/2*.md; do
date=$(basename "$f" | cut -d'-' -f1-3)
title=$(head -1 "$f" | sed 's/^# //' | sed 's/^Incident Report: //' | sed 's/^Scope Check: //' | sed 's/^Impact Analysis: //' | sed 's/^Tech Decision: //')
type="Other"
echo "$f" | grep -q "incident" && type="Incident Report"
echo "$f" | grep -q "scope" && type="Scope Check"
echo "$f" | grep -q "impact" && type="Impact Analysis"
echo "$f" | grep -q -v "incident\|scope\|impact" && type="Tech Decision"
status=$(grep "^**Status:**" "$f" | head -1 | sed 's/.*: //' | sed 's/\*//g')
echo "| $date | [$title](./$(basename $f)) | $type | $status |" >> project-decisions/README.md
done
Adaptation Rules
- Always save to file — every incident report gets persisted in
project-decisions/ - Blameless — never name individuals as cause; focus on systems and processes
- Be precise on timeline — minute-level accuracy matters for incidents
- Scan git history — check recent deploys, reverts, and code changes
- Quantify impact — "500 users affected for 45 minutes" not "some users had issues"
- 5 Whys minimum — keep asking why until you reach a systemic cause
- Include what went well — incidents reveal strengths too
- Action items must be specific — "improve monitoring" is not actionable; "add alert for error rate > 5% on /api/payments" is
- Every action needs an owner — unowned actions don't get done
- Check for recurrence — if this happened before, escalate systemic fixes
- Include lucky breaks — what could have made it worse helps prioritize prevention
- Scale to severity — SEV-4 gets 1 page, SEV-1 gets the full treatment
Summary
End every incident report with:
- One-line summary — what happened in one sentence
- Severity — SEV level
- Duration — total incident time
- Root cause — one sentence
- Action items count — X immediate, Y short-term, Z medium-term
- Recurring — is this a pattern?
- File saved — confirm the document location