managing-incidents

SKILL.md

Incident Management

Provide end-to-end incident management guidance covering detection, response, communication, and learning. Emphasizes SRE culture, blameless post-mortems, and structured processes for high-reliability operations.

When to Use This Skill

Apply this skill when:

  • Setting up incident response processes for a team
  • Designing on-call rotations and escalation policies
  • Creating runbooks for common failure scenarios
  • Conducting blameless post-mortems after incidents
  • Implementing incident communication protocols (internal and external)
  • Choosing incident management tooling and platforms
  • Improving MTTR and incident frequency metrics

Core Principles

Incident Management Philosophy

Declare Early and Often: Do not wait for certainty. Declaring an incident enables coordination, can be downgraded if needed, and prevents delayed response.

Mitigation First, Root Cause Later: Stop customer impact immediately (rollback, disable feature, failover). Debug and fix root cause after stability restored.

Blameless Culture: Assume good intentions. Focus on how systems failed, not who failed. Create psychological safety for honest learning.

Clear Command Structure: Assign Incident Commander (IC) to own coordination. IC delegates tasks but does not do hands-on debugging.

Communication is Critical: Internal coordination via dedicated channels, external transparency via status pages. Update stakeholders every 15-30 minutes during critical incidents.

Severity Classification

Standard severity levels with response times:

SEV0 (P0) - Critical Outage:

  • Impact: Complete service outage, critical data loss, payment processing down
  • Response: Page immediately 24/7, all hands on deck, executive notification
  • Example: API completely down, entire customer base affected

SEV1 (P1) - Major Degradation:

  • Impact: Major functionality degraded, significant customer subset affected
  • Response: Page during business hours, escalate off-hours, IC assigned
  • Example: 15% error rate, critical feature unavailable

SEV2 (P2) - Minor Issues:

  • Impact: Minor functionality impaired, edge case bug, small user subset
  • Response: Email/Slack alert, next business day response
  • Example: UI glitch, non-critical feature slow

SEV3 (P3) - Low Impact:

  • Impact: Cosmetic issues, no customer functionality affected
  • Response: Ticket queue, planned sprint
  • Example: Visual inconsistency, documentation error

For detailed severity decision framework and interactive classifier, see references/severity-classification.md.

Incident Roles

Incident Commander (IC):

  • Owns overall incident response and coordination
  • Makes strategic decisions (rollback vs. debug, when to escalate)
  • Delegates tasks to responders (does NOT do hands-on debugging)
  • Declares incident resolved when stability confirmed

Communications Lead:

  • Posts status updates to internal and external channels
  • Coordinates with stakeholders (executives, product, support)
  • Drafts post-incident customer communication
  • Cadence: Every 15-30 minutes for SEV0/SEV1

Subject Matter Experts (SMEs):

  • Hands-on debugging and mitigation
  • Execute runbooks and implement fixes
  • Provide technical context to IC

Scribe:

  • Documents timeline, actions, decisions in real-time
  • Records incident notes for post-mortem reconstruction

Assign roles based on severity:

  • SEV2/SEV3: Single responder
  • SEV1: IC + SME(s)
  • SEV0: IC + Communications Lead + SME(s) + Scribe

For detailed role responsibilities, see references/incident-roles.md.

On-Call Management

Rotation Patterns

Primary + Secondary:

  • Primary: First responder
  • Secondary: Backup if primary doesn't ack within 5 minutes
  • Rotation length: 1 week (optimal balance)

Follow-the-Sun (24/7):

  • Team A: US hours, Team B: Europe hours, Team C: Asia hours
  • Benefit: No night shifts, improved work-life balance
  • Requires: Multiple global teams

Tiered Escalation:

  • Tier 1: Junior on-call (common issues, runbook-driven)
  • Tier 2: Senior on-call (complex troubleshooting)
  • Tier 3: Team lead/architect (critical decisions)

Best Practices

  • Rotation length: 1 week per rotation
  • Handoff ceremony: 30-minute call to discuss active issues
  • Compensation: On-call stipend + time off after major incidents
  • Tooling: PagerDuty, Opsgenie, or incident.io
  • Limits: Max 2-3 pages per night; escalate if exceeded

Incident Response Workflow

Standard incident lifecycle:

Detection → Triage → Declaration → Investigation
Mitigation → Resolution → Monitoring → Closure
Post-Mortem (within 48 hours)

Key Decision Points

When to Declare: When in doubt, declare (can always downgrade severity)

When to Escalate:

  • No progress after 30 minutes
  • Severity increases (SEV2 → SEV1)
  • Specialized expertise needed

When to Close:

  • Issue resolved and stable for 30+ minutes
  • Monitoring shows all metrics at baseline
  • No customer-reported issues

For complete workflow details, see references/incident-workflow.md.

Communication Protocols

Internal Communication

Incident Slack Channel:

  • Format: #incident-YYYY-MM-DD-topic-description
  • Pin: Severity, IC name, status update template, runbook links

War Room: Video call for SEV0/SEV1 requiring real-time voice coordination

Status Update Cadence:

  • SEV0: Every 15 minutes
  • SEV1: Every 30 minutes
  • SEV2: Every 1-2 hours or at major milestones

External Communication

Status Page:

  • Tools: Statuspage.io, Instatus, custom
  • Stages: Investigating → Identified → Monitoring → Resolved
  • Transparency: Acknowledge issue publicly, provide ETAs when possible

Customer Email:

  • When: SEV0/SEV1 affecting customers
  • Timing: Within 1 hour (acknowledge), post-resolution (full details)
  • Tone: Apologetic, transparent, action-oriented

Regulatory Notifications:

  • Data Breach: GDPR requires notification within 72 hours
  • Financial Services: Immediate notification to regulators
  • Healthcare: HIPAA breach notification rules

For communication templates, see examples/communication-templates.md.

Runbooks and Playbooks

Runbook Structure

Every runbook should include:

  1. Trigger: Alert conditions that activate this runbook
  2. Severity: Expected severity level
  3. Prerequisites: System state requirements
  4. Steps: Numbered, executable commands (copy-pasteable)
  5. Verification: How to confirm fix worked
  6. Rollback: How to undo if steps fail
  7. Owner: Team/person responsible
  8. Last Updated: Date of last revision

Best Practices

  • Executable: Commands copy-pasteable, not just descriptions
  • Tested: Run during disaster recovery drills
  • Versioned: Track changes in Git
  • Linked: Reference from alert definitions
  • Automated: Convert manual steps to scripts over time

For runbook templates, see examples/runbooks/ directory.

Blameless Post-Mortems

Blameless Culture Tenets

Assume Good Intentions: Everyone made the best decision with information available.

Focus on Systems: Investigate how processes failed, not who failed.

Psychological Safety: Create environment where honesty is rewarded.

Learning Opportunity: Incidents are gifts of organizational knowledge.

Post-Mortem Process

1. Schedule Review (Within 48 Hours): While memory is fresh

2. Pre-Work: Reconstruct timeline, gather metrics/logs, draft document

3. Meeting Facilitation:

  • Timeline walkthrough
  • 5 Whys Analysis to identify systemic root causes
  • What Went Well / What Went Wrong
  • Define action items with owners and due dates

4. Post-Mortem Document:

  • Sections: Summary, Timeline, Root Cause, Impact, What Went Well/Wrong, Action Items
  • Distribution: Engineering, product, support, leadership
  • Storage: Archive in searchable knowledge base

5. Follow-Up: Track action items in sprint planning

For detailed facilitation guide and template, see references/blameless-postmortems.md and examples/postmortem-template.md.

Alert Design Principles

Actionable Alerts Only:

  • Every alert requires human action
  • Include graphs, runbook links, recent changes
  • Deduplicate related alerts
  • Route to appropriate team based on service ownership

Preventing Alert Fatigue:

  • Audit alerts quarterly: Remove non-actionable alerts
  • Increase thresholds for noisy metrics
  • Use anomaly detection instead of static thresholds
  • Limit: Max 2-3 pages per night

Tool Selection

Incident Management Platforms

PagerDuty:

  • Best for: Established enterprises, complex escalation policies
  • Cost: $19-41/user/month
  • When: Team size 10+, budget $500+/month

Opsgenie:

  • Best for: Atlassian ecosystem users, flexible routing
  • Cost: $9-29/user/month
  • When: Using Atlassian products, budget $200-500/month

incident.io:

  • Best for: Modern teams, AI-powered response, Slack-native
  • When: Team size 5-50, Slack-centric culture

For detailed tool comparison, see references/tool-comparison.md.

Status Page Solutions

Statuspage.io: Most trusted, easy setup ($29-399/month) Instatus: Budget-friendly, modern design ($19-99/month)

Metrics and Continuous Improvement

Key Incident Metrics

MTTA (Mean Time To Acknowledge):

  • Target: < 5 minutes for SEV1
  • Improvement: Better on-call coverage

MTTR (Mean Time To Recovery):

  • Target: < 1 hour for SEV1
  • Improvement: Runbooks, automation

MTBF (Mean Time Between Failures):

  • Target: > 30 days for critical services
  • Improvement: Root cause fixes

Incident Frequency:

  • Track: SEV0, SEV1, SEV2 counts per month
  • Target: Downward trend

Action Item Completion Rate:

  • Target: > 90%
  • Improvement: Sprint integration, ownership clarity

Continuous Improvement Loop

Incident → Post-Mortem → Action Items → Prevention
   ↑                                          ↓
   └──────────── Fewer Incidents ─────────────┘

Decision Frameworks

Severity Classification Decision Tree

Is production completely down or critical data at risk?
├─ YES → SEV0
└─ NO  → Is major functionality degraded?
          ├─ YES → Is there a workaround?
          │        ├─ YES → SEV1
          │        └─ NO  → SEV0
          └─ NO  → Are customers impacted?
                   ├─ YES → SEV2
                   └─ NO  → SEV3

Use interactive classifier: python scripts/classify-severity.py

Escalation Matrix

For detailed escalation guidance, see references/escalation-matrix.md.

Mitigation vs. Root Cause

Prioritize Mitigation When:

  • Active customer impact ongoing
  • Quick fix available (rollback, disable feature)

Prioritize Root Cause When:

  • Customer impact already mitigated
  • Fix requires careful analysis

Default: Mitigation first (99% of cases)

Anti-Patterns to Avoid

  • Delayed Declaration: Waiting for certainty before declaring incident
  • Skipping Post-Mortems: "Small" incidents still provide learning
  • Blame Culture: Punishing individuals prevents systemic learning
  • Ignoring Action Items: Post-mortems without follow-through waste time
  • No Clear IC: Multiple people leading creates confusion
  • Alert Fatigue: Noisy, non-actionable alerts cause on-call to ignore pages
  • Hands-On IC: IC should delegate debugging, not do it themselves

Implementation Checklist

Phase 1: Foundation (Week 1)

  • Define severity levels (SEV0-SEV3)
  • Choose incident management platform
  • Set up basic on-call rotation
  • Create incident Slack channel template

Phase 2: Processes (Weeks 2-3)

  • Create first 5 runbooks for common incidents
  • Set up status page
  • Train team on incident response
  • Conduct tabletop exercise

Phase 3: Culture (Weeks 4+)

  • Conduct first blameless post-mortem
  • Establish post-mortem cadence
  • Implement MTTA/MTTR dashboards
  • Track action items in sprint planning

Phase 4: Optimization (Months 3-6)

  • Automate incident declaration
  • Implement runbook automation
  • Monthly disaster recovery drills
  • Quarterly incident trend reviews

Integration with Other Skills

Observability: Monitoring alerts trigger incidents → Use incident-management for response

Disaster Recovery: DR provides recovery procedures → Incident-management provides operational response

Security Incident Response: Similar process with added compliance/forensics

Infrastructure-as-Code: IaC enables fast recovery via automated rebuild

Performance Engineering: Performance incidents trigger response → Performance team investigates post-mitigation

Examples and Templates

Runbook Templates:

  • examples/runbooks/database-failover.md
  • examples/runbooks/cache-invalidation.md
  • examples/runbooks/ddos-mitigation.md

Post-Mortem Template:

  • examples/postmortem-template.md - Complete blameless post-mortem structure

Communication Templates:

  • examples/communication-templates.md - Status updates, customer emails

On-Call Handoff:

  • examples/oncall-handoff-template.md - Weekly handoff format

Integration Scripts:

  • examples/integrations/pagerduty-slack.py
  • examples/integrations/statuspage-auto-update.py
  • examples/integrations/postmortem-generator.py

Scripts

Interactive Severity Classifier:

python scripts/classify-severity.py

Asks questions to determine appropriate severity level based on impact and urgency.

Further Reading

Books:

  • Google SRE Book: "Postmortem Culture" (Chapter 15)
  • "The Phoenix Project" by Gene Kim
  • "Site Reliability Engineering" (Full book)

Online Resources:

  • Atlassian: "How to Run a Blameless Postmortem"
  • PagerDuty: "Incident Response Guide"
  • Google SRE: "Postmortem Culture: Learning from Failure"

Standards:

  • Incident Command System (ICS) - FEMA standard adapted for tech
  • ITIL Incident Management - Traditional IT service management
Weekly Installs
13
GitHub Stars
310
First Seen
Jan 25, 2026
Installed on
gemini-cli13
opencode12
github-copilot11
cursor11
codex10
antigravity9