incident-response

Installation

SKILL.md

Incident Response

When production breaks, speed and clarity matter more than perfection. Mitigate first, root-cause later.

Context

Incident response is the skill of managing production failures under pressure. It combines technical diagnosis with communication discipline. The goal is not to find the perfect fix immediately -- it's to stop the bleeding, then investigate properly.

In a lifecycle-aware system, incident response must preserve the release boundary that just shipped. Do not widen scope into redesign during the incident. For brownfield systems, prefer mitigations that fail closed, preserve coexistence, and protect data integrity even if the temporary user experience becomes narrower.

Inputs

ci-cd-pipeline -- produced by the preceding skill in the lifecycle
architecture-doc -- produced by the preceding skill in the lifecycle
service-alerts -- produced by the preceding skill in the lifecycle

Process

Step 1: Confirm the Incident and Current Boundary

Capture the minimum facts needed to respond:

which user-facing behavior is failing
when the issue started relative to the last deploy
whether the incident involves a known unsupported flow, coexistence seam, or rollout boundary
which mitigation levers exist right now (rollback, route disable, traffic shaping, read-only mode, failover)

Automated alerting should catch most incidents before users report them. Key signals:

Error rate spike (5xx responses > threshold)
Latency increase (p95 > SLO)
Resource exhaustion (CPU, memory, disk, connections)
Business metric anomaly (orders dropping, signups stopping)

Step 2: Triage -- Classify Severity

Severity	Impact	Response Time	Example
SEV1	Service down, data loss, security breach	Immediate, all hands	Database corruption, complete outage
SEV2	Major feature broken, significant user impact	< 30 min	Payment processing failing
SEV3	Minor feature broken, workaround exists	< 4 hours	Search results incorrect
SEV4	Cosmetic, no user impact	Next business day	UI misalignment

Pick a severity based on current user impact, not on how scary the suspected root cause sounds.

Step 3: Assemble Response Team and Communication Cadence

Incident Commander: Owns coordination, not debugging
Technical Responder(s): Debug and fix
Communications Lead: Updates stakeholders and users

Define:

response channel and decision owner
internal update cadence
external/customer update trigger if needed

Step 4: Mitigate First, Prefer Fail-Closed Containment

Stop the bleeding. Acceptable mitigations:

Rollback to last known good deployment
Disable feature flag for broken feature
Scale up to handle load
Switch to backup/failover system
Apply rate limiting to protect remaining capacity
Temporarily reject the unsafe or unsupported path explicitly instead of guessing at partial support
Shift the affected flow to a safe fallback or read-only coexistence path

Mitigation does NOT need to fix the root cause. It needs to reduce impact.

For brownfield incidents, favor mitigations that preserve coexistence and data integrity over "keep everything available at any cost."

Step 5: Investigate with Evidence, Not Guesses

Once mitigated, investigate root cause with less time pressure:

Gather evidence (logs, metrics, traces from the incident window)
Capture deploy ID, config deltas, and rollback decisions
Build timeline (what happened, in what order)
Identify root cause (not just the symptom)
Compare observed behavior against the reviewed contract or release boundary
Hand off code-level root-cause work to systematic-debugging when the next step is a real code fix rather than an operational mitigation
Implement the proper fix only after the root-cause path is evidenced
Deploy fix with extra monitoring

Do not invent precision under pressure. If sync semantics, tenancy rules, or rollout intent remain uncertain, record them as open incident questions and keep the system in the safer mode.

Step 6: Verify Recovery and Handoff Cleanly

Before declaring the incident resolved:

confirm alerts and customer-visible symptoms have cleared
confirm rollback, fail-closed behavior, or fallback path is still active as intended
document any temporary guardrails that must remain until a permanent fix ships
hand off the follow-up work to the right downstream skills (runbooks, retrospective, tech-debt-management)

Step 7: Postmortem (within 48 hours)

Blameless postmortem -- focus on systems, not people:

Timeline of events
Root cause analysis (5 Whys)
What went well in the response
What went poorly
Action items with owners and deadlines

Outputs

incident-playbook -- produced by this skill
incident-timeline -- produced by this skill
postmortem-report -- produced by this skill

Quality Gate

Severity and current user impact are explicit
Containment path and rollback or fail-closed decision are explicit
Incident timeline and evidence sources are captured
Stakeholder communication cadence and owner are explicit
Post-incident follow-up actions and owners are defined

Anti-Patterns

Blame culture -- "Who broke it?" kills incident reporting. Focus on "what broke and why."
Hero culture -- One person always fixes everything. This is a single point of failure.
Postmortem without action items -- A postmortem that identifies problems but assigns no fixes will see the same incident again.
Skipping postmortem for "small" incidents -- Small incidents reveal systemic issues. Review them.
Debug-first incident handling -- Spending 45 minutes proving root cause while user impact continues. Contain first.
Unsafe availability bias -- Keeping a risky path live instead of failing closed at the reviewed release boundary.

Related Skills

ci-cd -- provides rollout, rollback, and release-boundary context
monitoring-observability -- provides the alerting that triggers response
systematic-debugging -- takes over after containment when the next move is a code-level fix
runbooks -- provides step-by-step response procedures
retrospective -- broader process improvement from incident patterns

Distribution

Public install surface: skills/.curated
Canonical authoring source: skills/07-operations/incident-response/SKILL.md
This package is exported for npx skills add/update compatibility.
Packaging stability: beta
Capability readiness: beta

Related skills

More from yknothing/prodcraft

Installs

Repository

yknothing/prodcraft

First Seen

Mar 27, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass