skills/borghei/claude-skills/incident-commander

incident-commander

Installation
SKILL.md

Incident Commander

The agent classifies incident severity, reconstructs timelines from heterogeneous event sources, and generates structured post-incident reviews with root cause analysis and action items.


Quick Start

# Classify an incident (JSON or stdin)
echo '{"description": "Database connections timing out", "affected_users": "80%", "business_impact": "high"}' \
  | python scripts/incident_classifier.py --format text

# Multi-dimensional severity scoring
python scripts/severity_classifier.py incident.json --format markdown

# Reconstruct timeline with phase detection and gap analysis
python scripts/timeline_reconstructor.py --input events.json --detect-phases --gap-analysis --format markdown

# Build structured timeline with MTTD/MTTR metrics
python scripts/incident_timeline_builder.py incident_data.json --format markdown

# Generate Post-Incident Review
python scripts/pir_generator.py --incident incident.json --rca-method fishbone --action-items --format markdown

# Generate postmortem with benchmark comparisons
python scripts/postmortem_generator.py incident_data.json --format markdown

Tools Overview

Tool Input Output
incident_classifier.py Incident description JSON Severity level, response teams, communication templates
severity_classifier.py Incident data with impact/signals Multi-dimensional score across 5 weighted dimensions
timeline_reconstructor.py Timestamped events array Chronological timeline with phases and gap analysis
incident_timeline_builder.py Incident + events JSON Timeline with MTTD/MTTR, phase distribution, comms templates
pir_generator.py Incident data + optional timeline PIR document with RCA (5 Whys, Fishbone, Timeline, Bow Tie)
postmortem_generator.py Incident + resolution + action items Postmortem with benchmarks, factor analysis, coverage gaps

Workflow 1: Incident Response (Detection to Resolution)

Step 1 -- Classify severity.

python scripts/severity_classifier.py incident.json --format json

The agent scores across five dimensions: revenue impact (25%), user scope (25%), data/security risk (20%), service criticality (15%), blast radius (15%).

Severity Definition Response Time Comms Cadence
SEV-1 Complete outage, data loss, security breach 15 min Every 15 min
SEV-2 Partial degradation, >25% users affected 30 min Every 30 min
SEV-3 Single feature affected, workaround available 2 hours At milestones
SEV-4 Cosmetic, dev/test only, no user impact Next business day Standard cycle

Validation checkpoint: Severity classification includes confidence score and recommended escalation path.

Step 2 -- Establish command.

The Incident Commander:

  • Assigns within 5 min (SEV-1) or 30 min (SEV-2)
  • Creates war room and incident tracking ticket
  • Sends initial notification using generated template
  • Coordinates between technical teams and stakeholders
  • Shields responders from external distractions

Step 3 -- Investigate and mitigate.

The agent generates targeted investigation commands based on the affected service:

kubectl get pods -n production -l app=<service>
kubectl logs -l app=<service> --tail=100
helm history <service> -n production

Decision framework for SEV-1/SEV-2:

  • Bias toward action over analysis
  • Prefer rollbacks to risky fixes under pressure
  • Document every decision for later review
  • Consult SMEs but do not block on them

Step 4 -- Communicate.

The agent generates three communication templates per severity:

  1. Internal notification -- technical details, response team, war room link
  2. Executive summary -- business impact, ETA, leadership actions required
  3. Customer communication -- impact scope, what is being done, next update time

Validation checkpoint: All stakeholders notified within committed timeframes.


Workflow 2: Post-Incident Review

Step 1 -- Reconstruct the timeline.

python scripts/timeline_reconstructor.py --input events.json --detect-phases --gap-analysis --format markdown

The agent accepts events from logs, alerts, Slack messages, and deployment systems. Each event needs a timestamp and description. Optional fields: source, type, actor, severity.

Supported phases: detection, declaration, escalation, investigation, mitigation, communication, resolution.

Step 2 -- Perform root cause analysis.

python scripts/pir_generator.py --incident incident.json --timeline timeline.json --rca-method five_whys --action-items

Available RCA methods:

Method Best For
five_whys Linear causal chains, quick analysis
fishbone Multi-category analysis (People, Process, Technology, Environment)
timeline Identifying missed decision points and delays
bow_tie Barriers analysis, prevention and mitigation controls

Step 3 -- Generate action items.

The agent categorizes action items as: immediate_fix, process_improvement, monitoring_alerting, documentation, training, architectural, tooling.

Each action item includes: title, owner, priority, deadline, success criteria, and dependencies.

Step 4 -- Validate postmortem quality.

python scripts/postmortem_generator.py incident_data.json --format json

The agent checks:

  • Every contributing factor has at least one action item (coverage gap detection)
  • Action items have quality scores (0-100) based on specificity
  • MTTD/MTTR benchmarked against industry standards
  • Missing actions suggested for uncovered themes

Validation checkpoint: Zero coverage gaps. All P0 action items have owners and deadlines within 48 hours.


Workflow 3: Escalation Management

Technical escalation path:

Level Role SEV-1 Trigger SEV-2 Trigger
L1 On-call engineer Immediate 15 min
L2 Senior engineer / Team lead 30 min 1 hour
L3 Engineering Manager / Staff 45 min 2 hours
L4 Director / CTO 1 hour 4 hours

Business escalation:

Severity Duration Escalate To
SEV-1 Immediate VP Engineering
SEV-1 30 min CTO + Customer Success VP
SEV-1 1 hour CEO + Full Executive Team
SEV-2 2 hours VP Engineering
SEV-2 4 hours CTO

Anti-Patterns

  1. Individual blame in postmortems -- focus on system failures. "Why did the process allow this?" not "Why did Alice do this?"
  2. Skipping PIR for SEV-2 -- every SEV-1 and SEV-2 gets a postmortem within 3 business days.
  3. Action items without owners -- every item needs a specific person and deadline.
  4. Deploying fixes under pressure without validation -- validate fixes before declaring resolution; plan for secondary failures.
  5. Communication gaps -- provide updates even when there is no new information.

Troubleshooting

Problem Cause Solution
Classifier assigns SEV1 to minor issues Description keywords trigger high severity without impact data Provide affected_users percentage and business_impact fields
Timeline shows "No valid events found" Timestamps in unsupported format or missing timestamp key Use ISO-8601, YYYY-MM-DD HH:MM:SS, or Unix epoch
PIR produces shallow 5 Whys Incident data lacks detail Enrich input with affected_services, customer_impact; supply timeline via --timeline
Postmortem marks all action items invalid Missing required fields Each action item needs title, owner, priority, deadline
Severity score seems too low Flat description without structured impact data Provide full schema with impact, signals, context keys

References

Guide Path
Incident Response Framework references/incident-response-framework.md
Severity Matrix references/incident_severity_matrix.md
Communication Templates references/communication_templates.md
RCA Frameworks Guide references/rca_frameworks_guide.md
SLA Management references/sla-management-guide.md

Integration Points

Skill Integration
senior-devops Monitoring alerts feed timeline; runbook templates inform playbooks
senior-secops Security incidents auto-escalate to SEV-1; breach indicators trigger SecOps response
release-orchestrator Deployment events feed timeline; rollback data informs release gates
senior-architect Architectural root causes escalate to architecture review
code-reviewer PIR action items route to code review workflows

Last Updated: April 2026 Version: 1.1.0

Weekly Installs
83
GitHub Stars
103
First Seen
3 days ago