incident-response

SKILL.md

Incident Response

Severity Levels

Assign a severity level immediately upon incident detection. Severity determines response urgency, communication cadence, and escalation path.

Severity Name Description Examples Response Time Update Cadence Responders
P0 Total Outage Complete service unavailability. All users affected. Revenue-impacting. Site completely down, data corruption, security breach with active exploitation < 15 minutes Every 15 minutes All hands on deck, executive notification
P1 Major Degradation Core functionality severely impaired. Large portion of users affected. Payment processing broken, authentication failing for most users, major data pipeline stalled < 30 minutes Every 30 minutes On-call engineer + team lead, stakeholder notification
P2 Partial Impact Non-core functionality broken or core functionality degraded for a subset of users. Search feature down, slow responses in one region, intermittent errors for some users < 2 hours Every 2 hours On-call engineer
P3 Minor Issue Cosmetic issues, minor bugs, or issues with workarounds available. UI glitch, non-critical background job delayed, minor data inconsistency Next business day Daily (if ongoing) Assigned engineer

Escalation Rules

  • If a P2 is not resolved within 4 hours, escalate to P1.
  • If a P1 is not resolved within 2 hours, escalate to P0.
  • Any incident involving data breach or security compromise is automatically P0.
  • When in doubt, over-classify. It is better to downgrade than to under-respond.

Triage Procedure

When an incident is detected (alert, user report, or monitoring), follow this triage sequence:

Step 1: Assess Impact

Answer these questions within the first 5 minutes:

  • Who is affected? (All users, subset, internal only)
  • What is broken? (Specific feature, entire service, data integrity)
  • When did it start? (Check monitoring for onset time)
  • Is it getting worse? (Error rate trending up, stable, or recovering)

Step 2: Assign Severity

Based on the impact assessment, assign a severity level using the table above. Document the severity and reasoning.

Step 3: Assemble the Response Team

Severity Who to Notify
P0 On-call engineer, engineering manager, VP Engineering, customer support lead, communications
P1 On-call engineer, engineering manager, customer support lead
P2 On-call engineer, relevant team lead
P3 Assigned engineer (via ticket)

Step 4: Designate Roles

For P0 and P1 incidents, explicitly assign these roles:

Role Responsibility
Incident Commander (IC) Coordinates response, makes decisions, delegates tasks. Does NOT debug.
Technical Lead Leads investigation and mitigation. Communicates findings to IC.
Communications Lead Drafts and sends status updates to stakeholders and customers.
Scribe Documents timeline, actions taken, and decisions in real time.

Step 5: Open Incident Channel

Create a dedicated communication channel (Slack channel, incident bridge call):

  • Name it clearly: #incident-2024-03-15-payments-down
  • Pin the initial assessment and severity level.
  • All discussion and decisions happen in this channel.

Communication Templates

Initial Alert

INCIDENT DECLARED: [P0/P1/P2] - [Brief description]

Impact: [Who is affected and how]
Start time: [When it began, UTC]
Current status: [Investigating / Identified / Mitigating]
Incident Commander: [Name]
Channel: #incident-[date]-[topic]

Next update in [15/30] minutes.

Status Update

INCIDENT UPDATE: [P0/P1/P2] - [Brief description]

Status: [Investigating / Identified / Mitigating / Resolved]
Duration: [Time since start]
Current state: [What is happening now]
Actions taken: [What has been tried]
Next steps: [What will be done next]

Next update in [15/30] minutes.

Resolution Notification

INCIDENT RESOLVED: [P0/P1/P2] - [Brief description]

Duration: [Total time from detection to resolution]
Root cause: [One-sentence summary]
Resolution: [What fixed it]
Customer impact: [Summary of user-facing impact]
Follow-up: Post-incident review scheduled for [date/time].

Monitoring for recurrence.

Customer-Facing Communication

[Service Name] Status Update

We are aware of an issue affecting [description of user impact].
Our team is actively working to resolve this.

Started: [Time, timezone]
Current status: [Brief, non-technical description]

We will provide updates every [30 minutes / 1 hour].
We apologize for the inconvenience.

Investigation Methodology

Follow this structured approach rather than randomly checking things. Work from symptoms toward root cause.

Step 1: Check Dashboards

Start with the service overview dashboard. Look for:

  • Error rate spikes (when exactly did they start?)
  • Latency increases (gradual degradation or sudden jump?)
  • Traffic anomalies (unexpected spike or drop?)
  • Resource utilization (CPU, memory, disk, connections)

Step 2: Check Recent Deploys

Most incidents are caused by recent changes. Check:

# What was deployed recently?
git log --oneline --since="2 hours ago" origin/main

# Any recent infrastructure changes?
# Check deployment pipeline history, Terraform runs, config changes

Questions to answer:

  • Was anything deployed in the last 2 hours?
  • Was there a configuration change (feature flags, environment variables)?
  • Was there an infrastructure change (scaling, migration, certificate renewal)?

Step 3: Examine Logs

Search logs filtered to the incident timeframe:

# Example queries (adapt to your log aggregation platform)
# ELK/Kibana: level:ERROR AND service:order-service AND @timestamp >= "2024-03-15T14:00:00"
# Datadog: service:order-service status:error
# CloudWatch: filter @message like /ERROR/ | sort @timestamp desc

Look for:

  • Error messages with stack traces.
  • Repeated error patterns (same error thousands of times).
  • New error types that were not present before the incident.
  • Correlation between errors and the timeline.

Step 4: Hypothesize and Test

Based on data gathered, form a hypothesis and test it:

Hypothesis How to Test
Bad deploy caused it Compare error timeline with deploy timestamp. Roll back and observe.
Database is overloaded Check connection pool, slow query log, lock contention.
External dependency is down Check dependency status page, test connectivity, check timeout rates.
Traffic spike overwhelmed the service Check request rate, compare to normal baseline, check auto-scaling.
DNS or certificate issue Test DNS resolution, check certificate expiry, verify SSL handshake.
Memory leak Check memory usage trend, look for OOM kills in system logs.
Data corruption Query for inconsistent data, check recent migration or backfill jobs.

Step 5: Verify the Fix

After applying a fix:

  • Error rate returning to baseline.
  • Latency returning to normal.
  • No new error patterns appearing.
  • Affected functionality manually verified.
  • Monitor for at least 15 minutes (P0/P1) or 30 minutes (P2) before declaring resolved.

Common Mitigation Actions

When the root cause is identified (or even before, to reduce impact), apply the appropriate mitigation:

Action When to Use How Risk
Rollback Bad deploy identified Revert to previous known-good version via deployment pipeline May lose new features; verify database compatibility
Feature flag toggle New feature causing issues Disable the flag in your feature management system Requires feature flags to be in place
Horizontal scaling Service overwhelmed by traffic Increase instance count via auto-scaler or manual scaling Increased cost; may not help if bottleneck is downstream
Cache clear Stale or corrupted cached data Flush application cache (Redis FLUSHDB, CDN purge) Temporary increase in origin load after flush
Circuit breaker Failing dependency cascading Activate circuit breaker to fail fast instead of waiting Gracefully degraded experience for users
Traffic shedding Total overload Rate limit or redirect traffic, enable maintenance page Users see errors or degraded service
Database failover Primary database unresponsive Promote replica to primary (if configured) Brief downtime during promotion; verify replication lag
DNS redirect Entire region or provider down Update DNS to point to backup region or provider Propagation delay (use low TTL proactively)
Restart Process stuck, memory leak Rolling restart of application instances Brief capacity reduction during restart
Hotfix Small targeted code fix needed Fast-track a minimal change through deployment pipeline Bypasses normal review; must be reviewed post-incident

Rollback Procedure

# Verify the last known-good version
git log --oneline -10 origin/main

# Tag the rollback point
git tag -a incident-rollback-2024-03-15 -m "Rolling back due to P1 incident"

# Trigger deployment of previous version
# (Adapt to your deployment pipeline)
# Example: Kubernetes
kubectl rollout undo deployment/order-service

# Verify rollback is deployed
kubectl rollout status deployment/order-service

# Monitor error rate and confirm reduction

Post-Incident Review

Conduct a post-incident review (PIR) within 48 hours of resolution for P0/P1 incidents and within one week for P2 incidents.

Post-Incident Review Template

# Post-Incident Review: [Incident Title]

**Date**: [date] | **Severity**: [P0/P1/P2] | **Duration**: [time] | **Author**: [name]

## Summary
[2-3 sentences: what happened, the impact, and the resolution.]

## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:00 | Alert fires: order error rate > 5% |
| 14:05 | P1 declared, incident channel created |
| 14:15 | Recent deploy at 13:45 identified as suspect |
| 14:25 | Rollback deployed |
| 14:45 | Resolved, monitoring for recurrence |

## Impact
Users affected, duration, revenue/SLA impact, data impact.

## Root Cause
[Detailed technical explanation of what went wrong and why.]

## Five Whys
1. **Why** did orders fail? -> Payment validation threw an exception.
2. **Why** did validation throw? -> Null value for a non-nullable field.
3. **Why** was the field null? -> Migration added column but did not backfill.
4. **Why** was backfill missed? -> No checklist step for backfill verification.
5. **Why** no checklist step? -> Migration procedures were undocumented.

## Action Items
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| Add integration test for null field validation | @alice | High | YYYY-MM-DD | TODO |
| Lower alert threshold from 5% to 2% | @bob | High | YYYY-MM-DD | TODO |
| Add feature flag to payment flow | @carol | Medium | YYYY-MM-DD | TODO |

## Lessons Learned
- What went well: [e.g., quick detection, rapid team assembly]
- What could improve: [e.g., rollback automation, test coverage]

Blameless Culture Principles

Post-incident reviews are learning opportunities, not blame sessions. Adhere to these principles:

Principle Practice
Assume good intent People made the best decisions they could with the information they had at the time.
Focus on systems, not individuals Ask "what allowed this to happen?" not "who caused this?"
Separate the what from the who Describe actions taken without naming individuals in the root cause. Use role titles if context is needed.
Reward transparency Publicly thank people who report incidents, share mistakes, or identify risks.
Follow through on action items PIR action items are tracked and completed. Unfixed systemic issues lead to repeat incidents.
Share learnings broadly Publish PIR summaries (redacted if needed) so other teams learn too.

Runbook Authoring Guide

A runbook is a step-by-step guide for responding to a specific alert or operational scenario.

Runbook Structure

Every runbook follows this structure:

# Runbook: [Alert or Scenario Name]

## When to Use
[Describe the alert, symptom, or scenario that triggers this runbook.]

## Prerequisites
- Access to [systems, dashboards, tools]
- Permissions: [required roles or access levels]

## Steps

### 1. Verify the Problem
```bash
curl -s https://monitoring.example.com/api/v1/query \
  --data-urlencode 'query=rate(http_errors_total{service="order-service"}[5m])'

Expected: Error rate below 0.01. If above, continue. If normal, check thresholds and close.

2. Apply Mitigation

# Option A: Restart
kubectl rollout restart deployment/order-service
# Option B: Rollback
kubectl rollout undo deployment/order-service

3. Verify Resolution

Expected: Error rate drops below 0.01 within 5 minutes.

4. Escalation

If unresolved: escalate to [team], contact [channel/phone], provide [context].


### Runbook Best Practices

| Practice | Reason |
|----------|--------|
| Use exact commands, not descriptions | Under stress, responders should copy-paste, not interpret |
| Include expected output | So responders know if the command worked |
| Provide verification after each step | Catch issues early, do not proceed blindly |
| Include a rollback for each step | If a mitigation step makes things worse |
| Test runbooks regularly | Outdated runbooks cause confusion during real incidents |
| Date-stamp and version runbooks | Know when it was last verified |
| Link from alert to runbook | Reduce time-to-runbook to one click |

## Incident Response Checklist

Quick reference during an active incident:

- [ ] Impact assessed (who, what, when, trending).
- [ ] Severity assigned and documented.
- [ ] Incident channel or bridge opened.
- [ ] Roles assigned (IC, Technical Lead, Comms, Scribe).
- [ ] Initial stakeholder notification sent.
- [ ] Timeline being recorded in real time.
- [ ] Investigation following structured methodology (dashboards, deploys, logs, hypothesize, test).
- [ ] Mitigation applied and impact reducing.
- [ ] Resolution verified with monitoring data.
- [ ] All-clear communication sent.
- [ ] Post-incident review scheduled (within 48 hours for P0/P1).
- [ ] Action items created with owners and due dates.
Weekly Installs
2
GitHub Stars
5
First Seen
Feb 26, 2026
Installed on
opencode2
gemini-cli2
claude-code2
github-copilot2
codex2
kimi-cli2