incident-response
Incident Response
When production breaks, speed and clarity matter more than perfection. Mitigate first, root-cause later.
Context
Incident response is the skill of managing production failures under pressure. It combines technical diagnosis with communication discipline. The goal is not to find the perfect fix immediately -- it's to stop the bleeding, then investigate properly.
In a lifecycle-aware system, incident response must preserve the release boundary that just shipped. Do not widen scope into redesign during the incident. For brownfield systems, prefer mitigations that fail closed, preserve coexistence, and protect data integrity even if the temporary user experience becomes narrower.
Inputs
- ci-cd-pipeline -- produced by the preceding skill in the lifecycle
- architecture-doc -- produced by the preceding skill in the lifecycle
- service-alerts -- produced by the preceding skill in the lifecycle
Process
Step 1: Confirm the Incident and Current Boundary
Capture the minimum facts needed to respond:
- which user-facing behavior is failing
- when the issue started relative to the last deploy
- whether the incident involves a known unsupported flow, coexistence seam, or rollout boundary
- which mitigation levers exist right now (rollback, route disable, traffic shaping, read-only mode, failover)
Automated alerting should catch most incidents before users report them. Key signals:
- Error rate spike (5xx responses > threshold)
- Latency increase (p95 > SLO)
- Resource exhaustion (CPU, memory, disk, connections)
- Business metric anomaly (orders dropping, signups stopping)
Step 2: Triage -- Classify Severity
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| SEV1 | Service down, data loss, security breach | Immediate, all hands | Database corruption, complete outage |
| SEV2 | Major feature broken, significant user impact | < 30 min | Payment processing failing |
| SEV3 | Minor feature broken, workaround exists | < 4 hours | Search results incorrect |
| SEV4 | Cosmetic, no user impact | Next business day | UI misalignment |
Pick a severity based on current user impact, not on how scary the suspected root cause sounds.
Step 3: Assemble Response Team and Communication Cadence
- Incident Commander: Owns coordination, not debugging
- Technical Responder(s): Debug and fix
- Communications Lead: Updates stakeholders and users
Define:
- response channel and decision owner
- internal update cadence
- external/customer update trigger if needed
Step 4: Mitigate First, Prefer Fail-Closed Containment
Stop the bleeding. Acceptable mitigations:
- Rollback to last known good deployment
- Disable feature flag for broken feature
- Scale up to handle load
- Switch to backup/failover system
- Apply rate limiting to protect remaining capacity
- Temporarily reject the unsafe or unsupported path explicitly instead of guessing at partial support
- Shift the affected flow to a safe fallback or read-only coexistence path
Mitigation does NOT need to fix the root cause. It needs to reduce impact.
For brownfield incidents, favor mitigations that preserve coexistence and data integrity over "keep everything available at any cost."
Step 5: Investigate with Evidence, Not Guesses
Once mitigated, investigate root cause with less time pressure:
- Gather evidence (logs, metrics, traces from the incident window)
- Capture deploy ID, config deltas, and rollback decisions
- Build timeline (what happened, in what order)
- Identify root cause (not just the symptom)
- Compare observed behavior against the reviewed contract or release boundary
- Hand off code-level root-cause work to
systematic-debuggingwhen the next step is a real code fix rather than an operational mitigation - Implement the proper fix only after the root-cause path is evidenced
- Deploy fix with extra monitoring
Do not invent precision under pressure. If sync semantics, tenancy rules, or rollout intent remain uncertain, record them as open incident questions and keep the system in the safer mode.
Step 6: Verify Recovery and Handoff Cleanly
Before declaring the incident resolved:
- confirm alerts and customer-visible symptoms have cleared
- confirm rollback, fail-closed behavior, or fallback path is still active as intended
- document any temporary guardrails that must remain until a permanent fix ships
- hand off the follow-up work to the right downstream skills (
runbooks,retrospective,tech-debt-management)
Step 7: Postmortem (within 48 hours)
Blameless postmortem -- focus on systems, not people:
- Timeline of events
- Root cause analysis (5 Whys)
- What went well in the response
- What went poorly
- Action items with owners and deadlines
Outputs
- incident-playbook -- produced by this skill
- incident-timeline -- produced by this skill
- postmortem-report -- produced by this skill
Quality Gate
- Severity and current user impact are explicit
- Containment path and rollback or fail-closed decision are explicit
- Incident timeline and evidence sources are captured
- Stakeholder communication cadence and owner are explicit
- Post-incident follow-up actions and owners are defined
Anti-Patterns
- Blame culture -- "Who broke it?" kills incident reporting. Focus on "what broke and why."
- Hero culture -- One person always fixes everything. This is a single point of failure.
- Postmortem without action items -- A postmortem that identifies problems but assigns no fixes will see the same incident again.
- Skipping postmortem for "small" incidents -- Small incidents reveal systemic issues. Review them.
- Debug-first incident handling -- Spending 45 minutes proving root cause while user impact continues. Contain first.
- Unsafe availability bias -- Keeping a risky path live instead of failing closed at the reviewed release boundary.
Related Skills
- ci-cd -- provides rollout, rollback, and release-boundary context
- monitoring-observability -- provides the alerting that triggers response
- systematic-debugging -- takes over after containment when the next move is a code-level fix
- runbooks -- provides step-by-step response procedures
- retrospective -- broader process improvement from incident patterns
Distribution
- Public install surface:
skills/.curated - Canonical authoring source:
skills/07-operations/incident-response/SKILL.md - This package is exported for
npx skills add/updatecompatibility. - Packaging stability:
beta - Capability readiness:
beta
More from yknothing/prodcraft
system-design
Use when reviewed requirements or specifications are ready and the team must decide high-level architecture, component boundaries, integration seams, or brownfield coexistence strategy before API design, technology selection, or task planning.
6ci-cd
Use when a reviewed implementation slice needs an automated build, test, and deployment pipeline, especially when brownfield rollback, release-boundary checks, contract/integration gates, and staged delivery must be explicit before shipping.
6intake
The mandatory gateway for all new engineering work. Triage and route new products, apps, features, migrations, tech-debt, or any 'not sure where to start' request to the correct lifecycle path. Use before starting design or implementation. Do not use for ongoing tasks, specific debugging, or PR reviews.
6feature-development
Use when a reviewed task slice has tests or acceptance targets and the team must turn it into a small, mergeable implementation increment without expanding scope, breaking contracts, or hiding release-boundary risk.
6monitoring-observability
Use when a live service or newly delivered release needs actionable telemetry, dashboards, and alerts that expose real user-impactful boundaries, especially when brownfield coexistence rules, unsupported-flow safety, rollback health, or queue/backfill behavior must be visible before incidents escalate.
6requirements-engineering
Use when the work is still at the \u201Cwhat should we build\u201D stage and approved discovery inputs or entry-stack outputs must become prioritized requirements and scope boundaries before specification, architecture, planning, or coding. Not for acceptance criteria, spec review, or implementation.
6