incident-response
Incident Response Skill
In a crisis, follow the process. Panic is the enemy of resolution.
Severity Levels
| Level | Definition | Response Time | Escalation |
|---|---|---|---|
| P1 | Service down, all users affected | Immediate (drop everything) | Notify leadership within 15 min |
| P2 | Major feature broken, no workaround | < 1 hour | Notify team lead |
| P3 | Feature degraded, workaround exists | < 4 hours | Normal channels |
| P4 | Minor issue, cosmetic or low-impact | < 24 hours | Next sprint |
When unsure: Triage UP (P2 → P1), not down. Downgrade after investigation.
Response Phases
1. Detect — Know something is wrong
| Detection Source | Reliability | Action |
|---|---|---|
| Automated monitoring/alerts | High | Trust the data, verify scope |
| User reports (1-2) | Medium | Reproduce, check if isolated |
| User reports (many) | High | Treat as confirmed |
| Internal discovery | Medium | Check if it's already in production |
2. Triage — Assess impact and urgency
Answer these in order (each takes < 1 minute):
- Who is affected? All users / subset / internal only
- How many? Percentage or count
- What can't they do? Core function or edge case
- Is there a workaround? If yes, communicate it immediately
- What changed recently? Last deploy, config change, dependency update
- Severity level? Assign P1-P4 based on answers above
3. Resolve — Fix it (safely)
| Situation | Best Action | Why |
|---|---|---|
| Recent deploy caused it | Rollback first | Fastest path to recovery |
| Config change caused it | Revert config | No code deploy needed |
| Root cause is clear, fix is small | Hotfix | If rollback isn't possible |
| Root cause unclear | Enable debug logging + check recent changes | Don't guess-and-deploy |
| Third-party dependency is down | Activate fallback or communicate ETA | You can't fix their service |
Golden rule: Restore service first, then investigate. Don't debug in production while users wait.
4. Review — Learn from it (post-mortem)
Do this within 48 hours while memory is fresh.
Post-Mortem Template
# Incident: [Brief Title]
**Date**: YYYY-MM-DD
**Severity**: P1/P2/P3/P4
**Duration**: X hours Y minutes
**Impact**: [Who was affected, what they couldn't do]
## Summary
One paragraph: what happened, impact, resolution.
## Timeline
| Time (UTC) | Event |
| ---------- | ----- |
| HH:MM | First alert / user report |
| HH:MM | Triage: determined P-level |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed / service restored |
| HH:MM | Confirmed resolved |
## Root Cause
[5 Whys analysis — see root-cause-analysis skill]
## What Went Well
- [Detection was fast because...]
- [Rollback was clean because...]
## What Went Wrong
- [Alert was noisy / missed because...]
- [Took X minutes to find the right person because...]
## Action Items
| Action | Owner | Due Date | Status |
| ------ | ----- | -------- | ------ |
| Add monitoring for X | @name | YYYY-MM-DD | Open |
| Write runbook for Y | @name | YYYY-MM-DD | Open |
Communication Templates
To Users (Status Page / Email)
[Service] is experiencing [issues/downtime]. We are investigating and will update every [30 min / 1 hour]. Current workaround: [describe if applicable].
To Leadership
Incident P[X]: [Service] [is down / degraded] since [time]. Impact: [N users / $X revenue]. ETA for resolution: [time / investigating]. Next update: [time].
Resolution Announcement
Resolved: [Service] has been restored as of [time]. Root cause: [one sentence]. Full post-mortem to follow within 48 hours.
On-Call Handoff Checklist
- Active incidents (status + next steps)
- Recent deploys (last 24h, any risky changes)
- Known issues (monitoring gaps, flaky tests)
- Pending alerts (expected noise vs real signal)
- Escalation contacts (who to call at 3am)
Synapses
See synapses.json for connections.