incident-management
Installation
SKILL.md
Incident Management
Implement effective incident management processes including severity definitions, escalation matrices, war room procedures, and blameless post-mortem templates.
When to Use
- Establishing incident management processes for production systems
- Defining severity levels and escalation procedures
- Running war rooms and coordinating incident response
- Conducting blameless post-incident reviews
- Building on-call schedules and notification workflows
- Meeting compliance requirements for incident response (SOC 2, HIPAA, PCI DSS)
Severity Levels
severity_definitions:
SEV1_critical:
impact: "Complete service outage or data breach affecting all/most customers"
examples:
- Production site completely down
- Data breach confirmed or suspected
- Complete loss of a critical business function
- Security incident with active exploitation
response_time: "Immediate (within 5 minutes)"
update_frequency: "Every 15-30 minutes"
who_is_paged: "On-call engineer, engineering manager, incident commander, executive on-call"
communication: "Status page update, customer email, executive notification"
resolution_target: "< 1 hour to mitigate"
SEV2_major:
impact: "Major feature broken or severe degradation affecting many customers"
examples:
- Key feature completely non-functional
- Significant performance degradation (>5x latency)
- Data processing pipeline completely stalled
- Partial outage affecting a region or segment
response_time: "Within 15 minutes"
update_frequency: "Every 30-60 minutes"
who_is_paged: "On-call engineer, engineering manager"
communication: "Status page update if customer-facing"
resolution_target: "< 4 hours to mitigate"
SEV3_moderate:
impact: "Minor feature impaired or degradation affecting some customers"
examples:
- Non-critical feature broken
- Moderate performance degradation
- Elevated error rate (below threshold for SEV2)
- Single-customer impact on non-critical function
response_time: "Within 1 hour during business hours"
update_frequency: "Every 2-4 hours"
who_is_paged: "On-call engineer"
communication: "Internal only unless customer inquires"
resolution_target: "< 1 business day"
SEV4_low:
impact: "Cosmetic issue, minor inconvenience, or non-customer-facing problem"
examples:
- UI cosmetic bug
- Non-critical monitoring gap
- Internal tool degradation
- Documentation inaccuracy in production
response_time: "Next business day"
update_frequency: "As needed"
who_is_paged: "None (ticket created)"
communication: "None"
resolution_target: "Within sprint planning cycle"
Escalation Matrix
escalation_matrix:
tier_1_on_call_engineer:
reached_via: "PagerDuty / OpsGenie alert"
responsibilities:
- Acknowledge alert within 5 minutes
- Assess severity and impact
- Begin troubleshooting
- Escalate to Tier 2 if unable to resolve within 30 minutes (SEV1/2)
escalation_trigger: "Cannot resolve, needs additional expertise, or severity upgrade"
tier_2_team_lead_or_sme:
reached_via: "PagerDuty escalation or direct page"
responsibilities:
- Provide subject matter expertise
- Assist with diagnosis and resolution
- Coordinate with other teams if cross-service issue
- Escalate to Tier 3 if broader coordination needed
escalation_trigger: "Multi-service issue, needs executive decision, or customer-facing SEV1"
tier_3_engineering_management:
reached_via: "PagerDuty escalation or direct call"
responsibilities:
- Assign incident commander (if not already)
- Allocate additional resources
- Make business decisions (feature disable, rollback, etc.)
- Coordinate external communication
escalation_trigger: "Business impact decision, extended outage, or PR/legal concern"
tier_4_executive:
reached_via: "Direct phone call"
responsibilities:
- Authorize extraordinary measures
- Manage board/investor communication
- Approve public statements
- Engage external resources (vendors, consultants)
escalation_trigger: "Major breach, extended SEV1, regulatory or legal implication"
time_based_escalation:
sev1:
"15 min no ack": "Re-page on-call + backup on-call"
"30 min unresolved": "Page team lead"
"1 hour unresolved": "Page engineering manager + executive on-call"
"2 hours unresolved": "All-hands engineering involvement"
sev2:
"30 min no ack": "Re-page on-call + backup on-call"
"1 hour unresolved": "Page team lead"
"4 hours unresolved": "Page engineering manager"
War Room Procedures
war_room:
activation: "Automatically for SEV1, on-demand for SEV2"
setup:
communication_channel:
primary: "Dedicated Slack channel (#incident-YYYY-MM-DD-brief-name)"
voice: "Zoom/Google Meet bridge (persistent link)"
backup: "Phone conference bridge"
channel_rules:
- "Only incident-related communication in the channel"
- "Use threads for side discussions"
- "Prefix messages with role (IC:, COMMS:, ENG:)"
roles:
incident_commander:
responsibilities:
- Own the incident from declaration to resolution
- Coordinate all response activities
- Make decisions on response actions
- Assign tasks to responders
- Determine when incident is resolved
- Schedule post-mortem
selection: "On-call IC roster, or senior engineer who declares the incident"
communications_lead:
responsibilities:
- Draft and publish status page updates
- Coordinate customer notifications
- Handle internal stakeholder updates
- Manage executive communication
- Document timeline in real-time
selection: "Designated from on-call comms roster or engineering manager"
technical_lead:
responsibilities:
- Lead technical diagnosis and troubleshooting
- Coordinate technical responders
- Recommend mitigation and resolution actions
- Verify fix effectiveness
selection: "Senior engineer with relevant system expertise"
scribe:
responsibilities:
- Document all actions, decisions, and findings
- Maintain real-time timeline
- Record who did what and when
- Capture screenshots and log excerpts
selection: "Any available team member (can be rotated)"
workflow:
1_declare:
- "IC declares incident with severity level"
- "War room channel and bridge created"
- "Roles assigned"
- "First status update posted"
2_assess:
- "Determine scope and customer impact"
- "Identify affected systems and services"
- "Establish working hypothesis"
3_mitigate:
- "Focus on restoring service first, root cause second"
- "IC approves all changes to production"
- "Changes documented in real-time"
- "Rollback if mitigation makes things worse"
4_resolve:
- "Confirm service restored to normal"
- "Verify monitoring shows healthy metrics"
- "IC declares incident resolved"
- "Final status page update"
5_follow_up:
- "Schedule post-mortem within 48 hours"
- "Assign action items from immediate findings"
- "Send internal summary"
On-Call Configuration
on_call_schedule:
rotation_structure:
primary:
rotation: "Weekly"
handoff: "Monday 10:00 AM local time"
team_size: "Minimum 5 engineers in rotation"
secondary:
rotation: "Weekly (offset from primary)"
activation: "If primary does not acknowledge within 10 minutes"
expectations:
response_time: "Acknowledge alert within 5 minutes"
availability: "Reachable by phone and laptop within 15 minutes"
handoff: "Document any ongoing issues during handoff"
compensation: "Per company on-call compensation policy"
health:
max_consecutive_weeks: 2
minimum_gap_between_rotations: "2 weeks"
post_incident_rest: "If engaged for 4+ hours overnight, late start next day"
burnout_monitoring: "Track pages per person per week, rebalance if needed"
pagerduty_configuration:
escalation_policy:
- level_1:
target: "Primary on-call"
timeout: "5 minutes"
- level_2:
target: "Secondary on-call"
timeout: "10 minutes"
- level_3:
target: "Engineering manager"
timeout: "15 minutes"
notification_rules:
high_urgency:
- "Push notification immediately"
- "Phone call after 1 minute"
- "SMS after 2 minutes"
low_urgency:
- "Push notification"
- "Email after 5 minutes"
Post-Mortem Template
# Post-Incident Review: [Incident Title]
**Date:** YYYY-MM-DD
**Severity:** SEV[1-4]
**Duration:** [Start time] to [End time] ([X hours Y minutes])
**Incident Commander:** [Name]
**Author:** [Name]
**Status:** Draft / In Review / Final
## Executive Summary
[2-3 sentence summary of what happened, the impact, and the resolution]
## Impact
- **Customer impact:** [Number/percentage of customers affected, what they experienced]
- **Duration of impact:** [How long customers were affected]
- **Revenue impact:** [Estimated financial impact, if applicable]
- **Data impact:** [Any data loss or corruption]
- **SLA impact:** [Any SLA breaches]
## Timeline (all times UTC)
| Time | Event |
|------|-------|
| HH:MM | [First anomaly detected by monitoring] |
| HH:MM | [Alert fired / customer report received] |
| HH:MM | [On-call engineer acknowledged] |
| HH:MM | [Incident declared at SEV level] |
| HH:MM | [War room established] |
| HH:MM | [Root cause identified] |
| HH:MM | [Mitigation applied] |
| HH:MM | [Service restored] |
| HH:MM | [Incident resolved] |
## Root Cause
[Detailed technical explanation of what caused the incident]
## Detection
- **How was the incident detected?** [Monitoring alert / customer report / manual observation]
- **Time to detect:** [Time from first anomaly to detection]
- **Could we have detected sooner?** [Yes/No, with explanation]
## Response
- **What went well:**
- [List things that worked effectively during response]
- [E.g., "Runbook for database failover was accurate and followed successfully"]
- [E.g., "Communication to customers was timely and clear"]
- **What could be improved:**
- [List things that slowed or hindered response]
- [E.g., "Took 20 minutes to identify the correct service owner"]
- [E.g., "Monitoring did not alert on the specific failure mode"]
## Contributing Factors
[List all factors that contributed to the incident occurring or being worse than it could have been. This is not about blame - it is about understanding the system.]
1. [Factor 1: e.g., "Configuration change was not tested in staging"]
2. [Factor 2: e.g., "Alert threshold was too high to catch gradual degradation"]
3. [Factor 3: e.g., "No circuit breaker between Service A and Service B"]
## Action Items
| ID | Action | Owner | Priority | Due Date | Status |
|----|--------|-------|----------|----------|--------|
| 1 | [Preventive action] | [Name] | P1 | YYYY-MM-DD | Open |
| 2 | [Detection improvement] | [Name] | P2 | YYYY-MM-DD | Open |
| 3 | [Process improvement] | [Name] | P2 | YYYY-MM-DD | Open |
| 4 | [Runbook update] | [Name] | P3 | YYYY-MM-DD | Open |
## Lessons Learned
[Key takeaways that should be shared broadly]
## Appendix
- [Link to monitoring dashboards during incident]
- [Link to relevant log queries]
- [Link to war room channel archive]
Post-Mortem Process
post_mortem_process:
scheduling:
sev1: "Within 48 hours of resolution"
sev2: "Within 1 week of resolution"
sev3: "Within 2 weeks (optional, based on learning potential)"
sev4: "Not required"
meeting_format:
duration: "60-90 minutes"
attendees:
required: "IC, technical lead, scribe, involved engineers"
optional: "Engineering manager, product manager, affected team leads"
agenda:
- "5 min: Review timeline and facts"
- "15 min: Walk through root cause and contributing factors"
- "15 min: Discuss what went well"
- "15 min: Discuss what could be improved"
- "15 min: Define and assign action items"
- "5 min: Identify lessons learned and sharing plan"
principles:
- "Blameless: Focus on systems and processes, not individuals"
- "Factual: Base discussion on data, logs, and observations"
- "Forward-looking: Prioritize preventive actions over assigning fault"
- "Complete: Address detection, response, and prevention"
- "Actionable: Every finding should produce a tracked action item"
action_item_tracking:
- "All action items entered into issue tracker (Jira, GitHub Issues)"
- "Priority assigned based on risk reduction potential"
- "Owner assigned with due date"
- "Reviewed in team standups and sprint planning"
- "Tracked to completion"
- "Monthly review of open post-mortem action items"
Incident Metrics
incident_metrics:
mttr:
name: "Mean Time to Resolve"
definition: "Average time from incident detection to resolution"
target: "SEV1: <1h, SEV2: <4h"
trending: "Track monthly, aim for improvement"
mttd:
name: "Mean Time to Detect"
definition: "Average time from incident start to detection"
target: "< 5 minutes for SEV1/2"
trending: "Monitors effectiveness of alerting"
mtta:
name: "Mean Time to Acknowledge"
definition: "Average time from alert to engineer acknowledgment"
target: "< 5 minutes"
trending: "Monitors on-call responsiveness"
incident_frequency:
name: "Incidents per week/month by severity"
target: "Trending downward"
trending: "Monitors system reliability improvement"
action_item_completion:
name: "Post-mortem action item completion rate"
target: "> 90% completed on time"
trending: "Monitors follow-through on improvements"
recurring_incidents:
name: "Percentage of incidents with same root cause as previous incident"
target: "< 10%"
trending: "Monitors effectiveness of preventive actions"
Incident Management Checklist
incident_management_checklist:
process_setup:
- [ ] Severity levels defined with clear criteria
- [ ] Escalation matrix documented
- [ ] On-call schedule established and staffed
- [ ] War room procedures documented
- [ ] Post-mortem template created
- [ ] Communication templates prepared (status page, email)
- [ ] Incident management tool configured (PagerDuty, OpsGenie)
per_incident:
- [ ] Incident declared with severity level
- [ ] War room established (SEV1/2)
- [ ] Roles assigned (IC, comms, technical lead, scribe)
- [ ] Timeline maintained in real-time
- [ ] Status page updated (customer-facing impact)
- [ ] Stakeholders notified per communication plan
- [ ] Resolution verified with monitoring
- [ ] Post-mortem scheduled
- [ ] Post-mortem conducted and published
- [ ] Action items tracked to completion
compliance:
- [ ] All SEV1/2 incidents have post-mortems
- [ ] Incident log maintained for audit evidence
- [ ] Metrics reported monthly
- [ ] On-call health monitored (pages per person)
- [ ] Annual incident response training conducted
- [ ] Annual incident response plan test completed
Best Practices
- Define severity levels with concrete examples so there is no ambiguity during an active incident
- Implement time-based escalation: if the on-call does not acknowledge, automatically escalate
- Focus on mitigation first, root cause second: restore service before investigating why it failed
- Run blameless post-mortems: the goal is to improve systems, not to assign fault to individuals
- Track post-mortem action items to completion: an unfinished action item means the same incident can recur
- Monitor incident metrics (MTTR, MTTD, frequency) as leading indicators of system reliability
- Protect on-call health: track page volume per person and redistribute if someone is overburdened
- Separate the incident commander role from the technical lead role in SEV1/2 incidents
- Practice incident response regularly with game days or chaos engineering exercises
- Archive incident records and post-mortems for compliance evidence and organizational learning
Weekly Installs
32
Repository
bagelhole/devop…t-skillsGitHub Stars
18
First Seen
4 days ago
Security Audits