incident-response
Incident Response
When to Use
Activate this skill when:
- Production service is down or returning errors to users
- Error rate has spiked beyond normal thresholds
- Performance has degraded significantly (latency increase, timeouts)
- An alert has fired from the monitoring system
- Users are reporting issues that indicate a systemic problem
- A failed deployment needs investigation and remediation
- Conducting a post-mortem or root cause analysis after an incident
Output: Write runbooks to docs/runbooks/<service>-runbook.md and post-mortems to postmortem-YYYY-MM-DD.md.
Do NOT use this skill for:
- Setting up monitoring or alerting rules (use
monitoring-setup) - Performing routine deployments (use
deployment-pipeline) - Docker image or infrastructure issues (use
docker-best-practices) - Feature development or code changes (use
python-backend-expertorreact-frontend-expert)
Instructions
Severity Classification
Classify every incident immediately. Severity determines response urgency, communication cadence, and escalation path.
| Severity | Impact | Examples | Response Time | Update Cadence |
|---|---|---|---|---|
| SEV1 (P1) | Complete outage, all users affected | Service down, data loss, security breach | Immediate (< 5 min) | Every 15 min |
| SEV2 (P2) | Major degradation, most users affected | Core feature broken, severe latency | < 15 min | Every 30 min |
| SEV3 (P3) | Partial degradation, some users affected | Non-critical feature broken, intermittent errors | < 1 hour | Every 2 hours |
| SEV4 (P4) | Minor issue, few users affected | Cosmetic bug, edge case error | < 4 hours | Daily |
Escalation rules:
- SEV1: Page on-call engineer + engineering manager immediately
- SEV2: Page on-call engineer, notify engineering manager
- SEV3: Notify on-call engineer via Slack
- SEV4: Create ticket, address during normal working hours
See references/escalation-contacts.md for the contact matrix.
5-Minute Triage Workflow
When an incident is detected, follow this triage workflow within the first 5 minutes.
┌─────────────────────────────────────────────────────────┐
│ MINUTE 0-1: Acknowledge and Classify │
│ • Acknowledge the alert or report │
│ • Assign severity (SEV1-SEV4) │
│ • Designate incident commander │
├─────────────────────────────────────────────────────────┤
│ MINUTE 1-2: Assess Scope │
│ • Check health endpoints for all services │
│ • Check error rate and latency dashboards │
│ • Determine: which services are affected? │
├─────────────────────────────────────────────────────────┤
│ MINUTE 2-3: Identify Recent Changes │
│ • Check: was there a recent deployment? │
│ • Check: any infrastructure changes? │
│ • Check: any external dependency issues? │
├─────────────────────────────────────────────────────────┤
│ MINUTE 3-4: Initial Communication │
│ • Post in #incidents channel │
│ • Update status page if SEV1/SEV2 │
│ • Page additional responders if needed │
├─────────────────────────────────────────────────────────┤
│ MINUTE 4-5: Begin Investigation or Mitigate │
│ • If recent deploy: consider immediate rollback │
│ • If not deploy-related: begin diagnostic commands │
│ • Start incident timeline log │
└─────────────────────────────────────────────────────────┘
Quick health check command:
./skills/incident-response/scripts/health-check-all-services.sh \
--output-dir ./incident-triage/
Incident Commander Role
The incident commander (IC) coordinates the response. They do NOT investigate directly.
IC responsibilities:
- Coordinate -- Assign tasks to responders, prevent duplicate work
- Communicate -- Post regular updates to stakeholders
- Decide -- Make go/no-go decisions on rollback, escalation, communication
- Track -- Maintain the incident timeline
- Close -- Declare the incident resolved and schedule the post-mortem
IC communication template (initial):
INCIDENT DECLARED: [Title]
Severity: [SEV1/SEV2/SEV3/SEV4]
Commander: [Name]
Start time: [UTC timestamp]
Impact: [What users are experiencing]
Status: Investigating
Next update: [Time]
IC communication template (update):
INCIDENT UPDATE: [Title]
Severity: [SEV level]
Duration: [Time since start]
Status: [Investigating/Identified/Mitigating/Resolved]
Current findings: [What we know]
Actions in progress: [What we are doing]
Next update: [Time]
Investigation Steps
Follow these diagnostic steps based on the type of issue.
Application Errors (FastAPI)
# 1. Check application logs for errors
./skills/incident-response/scripts/fetch-logs.sh \
--service backend \
--since "15 minutes ago" \
--output-dir ./incident-logs/
# 2. Check error rate from logs
docker logs app-backend --since 15m 2>&1 | grep -c "ERROR"
# 3. Check active connections and request patterns
curl -s http://localhost:8000/health/ready | jq .
# 4. Check if the issue is in a specific endpoint
docker logs app-backend --since 15m 2>&1 | \
grep "ERROR" | \
grep -oP '"path":"[^"]*"' | sort | uniq -c | sort -rn
# 5. Check Python process status
docker exec app-backend ps aux
docker exec app-backend python -c "import sys; print(sys.version)"
Database Issues (PostgreSQL)
# 1. Check database connectivity
docker exec app-db pg_isready -U postgres
# 2. Check active connections (connection pool exhaustion?)
docker exec app-db psql -U postgres -d app_prod -c "
SELECT count(*), state FROM pg_stat_activity
GROUP BY state ORDER BY count DESC;
"
# 3. Check for long-running queries (locks, deadlocks?)
docker exec app-db psql -U postgres -d app_prod -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration,
query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '30 seconds'
AND state != 'idle'
ORDER BY duration DESC;
"
# 4. Check for lock contention
docker exec app-db psql -U postgres -d app_prod -c "
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.relation = blocked_locks.relation
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
"
# 5. Check disk space
docker exec app-db df -h /var/lib/postgresql/data
Redis Issues
# 1. Check Redis connectivity
docker exec app-redis redis-cli ping
# 2. Check memory usage
docker exec app-redis redis-cli info memory | grep used_memory_human
# 3. Check connected clients
docker exec app-redis redis-cli info clients | grep connected_clients
# 4. Check slow log
docker exec app-redis redis-cli slowlog get 10
# 5. Check keyspace
docker exec app-redis redis-cli info keyspace
Network and Infrastructure
# 1. Check DNS resolution
nslookup api.example.com
# 2. Check SSL certificate expiry
echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null | \
openssl x509 -noout -dates
# 3. Check container resource usage
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"
# 4. Check disk space on host
df -h /
# 5. Check if dependent services are reachable
curl -sf https://external-api.example.com/health || echo "External API unreachable"
Remediation Actions
Immediate Mitigations (apply within minutes)
| Issue | Mitigation | Command |
|---|---|---|
| Bad deployment | Rollback | ./scripts/deploy.sh --rollback --env production --version $PREV_SHA --output-dir ./results/ |
| Connection pool exhausted | Restart backend | docker restart app-backend |
| Long-running query | Kill query | SELECT pg_terminate_backend(<pid>); |
| Memory leak | Restart service | docker restart app-backend |
| Redis full | Flush non-critical keys | redis-cli --scan --pattern "cache:*" | xargs redis-cli del |
| SSL expired | Apply new cert | Update cert in load balancer |
| Disk full | Clean logs/temp files | docker system prune -f |
Longer-Term Fixes (apply after stabilization)
- Fix the root cause in code -- Create a branch, fix, test, deploy through normal pipeline
- Add monitoring -- If the issue was not caught by existing alerts, add new alert rules
- Add tests -- Write regression tests for the failure scenario
- Update runbooks -- Document the new failure mode and remediation steps
Communication Protocol
Internal Communication
Channels:
#incidents-- Active incident coordination (SEV1/SEV2)#incidents-low-- SEV3/SEV4 tracking#engineering-- Post-incident summaries
Rules:
- All communication happens in the designated incident channel
- Use threads for investigation details, keep main channel for status updates
- IC posts updates at the defined cadence (see severity table)
- Tag relevant people explicitly, do not assume they are watching
- Timestamp all significant findings and actions
External Communication (SEV1/SEV2)
Status page update template:
[Investigating] We are investigating reports of [issue description].
Users may experience [user-visible impact].
We will provide an update within [time].
[Identified] The issue has been identified as [brief description].
We are working on a fix. Estimated resolution: [time estimate].
[Resolved] The issue affecting [service] has been resolved.
The root cause was [brief description].
We apologize for the disruption and will publish a detailed post-mortem.
Post-Mortem / RCA Framework
Conduct a blameless post-mortem within 48 hours of every SEV1/SEV2 incident. SEV3 incidents receive a lightweight review.
See references/post-mortem-template.md for the full template.
Post-mortem principles:
- Blameless -- Focus on systems and processes, not individuals
- Thorough -- Identify all contributing factors, not just the trigger
- Actionable -- Every finding must produce a concrete action item with an owner
- Timely -- Conduct within 48 hours while details are fresh
- Shared -- Publish to the entire engineering team
Post-mortem structure:
- Summary -- What happened, when, and what was the impact
- Timeline -- Minute-by-minute account of detection, investigation, mitigation
- Root cause -- The fundamental reason the incident occurred
- Contributing factors -- Other conditions that made the incident worse
- What went well -- Effective parts of the response
- What could be improved -- Gaps in detection, response, or tooling
- Action items -- Specific tasks with owners and due dates
Five Whys technique for root cause analysis:
Why did users see 500 errors?
-> Because the backend service returned errors to the load balancer.
Why did the backend service return errors?
-> Because database connections timed out.
Why did database connections time out?
-> Because the connection pool was exhausted.
Why was the connection pool exhausted?
-> Because a new endpoint opened connections without releasing them.
Why were connections not released?
-> Because the endpoint was missing the async context manager for sessions.
Root cause: Missing async context manager for database sessions in new endpoint.
Generate a structured incident report:
python skills/incident-response/scripts/generate-incident-report.py \
--title "Database connection pool exhaustion" \
--severity SEV2 \
--start-time "2024-01-15T14:30:00Z" \
--end-time "2024-01-15T15:15:00Z" \
--output-dir ./post-mortems/
Incident Response Scripts
| Script | Purpose | Usage |
|---|---|---|
scripts/fetch-logs.sh |
Fetch recent logs from services | ./scripts/fetch-logs.sh --service backend --since "30m" --output-dir ./logs/ |
scripts/health-check-all-services.sh |
Check health of all services | ./scripts/health-check-all-services.sh --output-dir ./health/ |
scripts/generate-incident-report.py |
Generate structured incident report | python scripts/generate-incident-report.py --title "..." --severity SEV1 --output-dir ./reports/ |
Quick Reference: Common Incident Patterns
| Pattern | Symptom | Likely Cause | First Action |
|---|---|---|---|
| 502/503 errors | Users see error page | Backend crashed or overloaded | Check docker ps, restart if needed |
| Slow responses | High latency, timeouts | DB queries, external API | Check slow query log, DB connections |
| Partial failures | Some endpoints fail | Single dependency down | Check individual service health |
| Memory growth | OOM kills, restarts | Memory leak | Check docker stats, restart |
| Error spike after deploy | Errors start exactly at deploy time | Bug in new code | Rollback immediately |
| Gradual degradation | Slowly worsening metrics | Resource exhaustion, connection leak | Check resource usage trends |
Output Files
Runbooks: Write to docs/runbooks/<service>-runbook.md:
# Runbook: [Service Name]
## Service Overview
- Purpose, dependencies, critical paths
## Common Issues
### Issue 1: [Description]
- **Symptoms:** [What you see]
- **Diagnosis:** [Commands to run]
- **Resolution:** [Steps to fix]
## Escalation
- On-call: #ops-oncall
- Service owner: @team-name
Post-mortems: Write to postmortem-YYYY-MM-DD.md:
# Post-Mortem: [Incident Title]
## Summary
- **Date:** YYYY-MM-DD
- **Severity:** SEV1-4
- **Duration:** X hours
- **Impact:** [Users/revenue affected]
## Timeline
- HH:MM - [Event]
## Root Cause
[Technical explanation]
## Action Items
- [ ] [Preventive measure] - Owner: @name - Due: YYYY-MM-DD