incident-remediating

Installation

SKILL.md

Incident Remediation Skill

Your job is to turn a confirmed (or probable) root cause into concrete action. You produce remediation options ordered by speed and safety, generate exact commands for the developer's actual environment, and create PR artifacts that are ready to submit.

Always lead with the fastest path to service restoration, even if it's not the permanent fix. Fixing the symptom now and the root cause later is the right call when users are actively affected.

Remediation Ordering Principle

For every incident, recommend options in this order:

1. Runtime / operational (minutes, no code change)
   → Rollback, feature flag, circuit breaker, scale out, cache flush, pool reset
   → Fastest to execute, easiest to revert, lowest risk

2. Configuration (minutes to hours, no deploy needed)
   → Timeout adjustments, pool size changes, rate limit tuning
   → Medium risk — test in staging if possible, but can be applied to prod in P0/P1

3. Code fix + hotfix deploy (hours)
   → Targeted fix in the implicated function/module
   → Higher risk — requires review, test, deploy cycle

4. Architectural remediation (days to weeks)
   → Add circuit breaker, add caching layer, refactor the problematic pattern
   → Out of scope for active incident — schedule as follow-up

Never skip straight to a code fix if a runtime remediation can restore service first. The code fix can follow once the bleeding has stopped.

Runtime Remediations

Read references/remediation-patterns.md for exact commands. The patterns covered are:

Kubernetes rollback — restore prior deployment revision
Feature flag disable — kill a misbehaving feature without a deploy
DB connection pool recovery — kill idle connections, restart leaking pods, tune pool size
Circuit breaker activation — stop sending traffic to a failing downstream
Horizontal scaling — add replicas to absorb load (only if bottleneck is in this service)
Cache invalidation — flush stale or corrupted cache entries
Secret / config rotation recovery — propagate a newly rotated credential
Dependency CVE patching — assess, upgrade, and PR

Always infer the developer's platform from context before producing commands — read k8s/, docker-compose.yml, .github/workflows/, Procfile, or deployment config files. Don't generate generic AWS commands if they're clearly running on k8s.

State the risk and revert path before every command. Engineers need to know what could go wrong before they run something in production.

Code Fix Generation

When root cause is localized to a specific function or module:

Read the actual file at the implicated location — never generate a fix for code you haven't read
Show the problematic block with an inline explanation of the failure mechanism (the why, not just the what)
Generate the corrected code with comments explaining the reasoning
Write a regression test that would have caught this before it reached production
Note edge cases the fix doesn't address — be explicit about the scope of the fix
Execution Protocol:
- Suggest the solution to the incident.
- Ask the user whether they want you to execute the suggested solution or not.
- Execute the solution (e.g., run commands, apply code changes) when authorized.

PR Artifact

For every code fix, produce a complete PR description. Save it to .docs/hotfix-<YYYYMMDD-HHMM>.md

## fix(<scope>): <imperative verb> <what>

Example title: fix(auth): resolve connection pool leak in UserService.fetchProfile

### Problem
[What broke, when it started, user impact — one paragraph. Name specific metrics or error rates.]

### Root Cause
[Full causal chain — name the specific commit, function, query, or dependency. Not "there was a bug".]

### Fix
[What changed and why it resolves the root cause. Explain the mechanism, not just the diff.]

### Testing
[How this was validated:
- Unit test added: `<test name>` covers `<specific scenario>`
- Load test: ran `<N>` requests, pool utilization stayed below `<X>%`
- Manual repro: reproduced the original failure, confirmed fix resolves it]

### Rollback
[Exact steps to revert:
- kubectl: `kubectl rollout undo deployment/<n>`
- Or: revert commit <sha> and push]

**Labels:** `incident` `hotfix` `severity-p<N>`
**Reviewers:** on-call lead + [domain owner]
**Linked issue:** [Jira ticket ID]

Runbook Generation

For incidents that are likely to recur, generate a runbook and save to runbooks/<incident-type>.md. Base the runbook on what was actually needed to diagnose and resolve this specific incident — don't write a generic template, write what an on-call engineer at 3am would need.

# Runbook: <Incident Type>

**Last updated:** <YYYY-MM-DD>
**Default severity:** P[N]
**Owner:** <team or service>

## When to use this runbook
[Specific observable symptoms — precise enough that an on-call engineer can pattern-match
at 3am without prior context. Name the exact error message, alert name, or metric threshold.]

## Escalate to P[N-1] if
[Conditions that push this beyond the default severity:]
- [e.g., data integrity questions arise]
- [e.g., not resolved within 30 minutes]

## Diagnosis Steps
1. Check [specific metric] in [specific dashboard URL or command]
   — expect to see [specific value or pattern] if this runbook applies
2. Run: `<exact command>` — look for: `<output pattern>`
3. Confirm: [how to know you've correctly identified the issue]

## Remediation

### Option A: Fast path ([estimated time])
[Use when: <condition>]
1. `<exact command with filled-in values>`
   Risk: [what could go wrong]
   Revert: `<revert command>`
2. Verify: [what to check to confirm it worked]

### Option B: Root fix ([estimated time])
[Use when: Option A didn't work or root cause is confirmed]
1. [Step]
2. Verify: [what to check]

## Verification
[Exact metric threshold, log pattern, or endpoint response that confirms full resolution.
Not "check if it's working" — name the specific thing to look for.]

## Escalation
If not resolved in [N] minutes: page [team] via [PagerDuty policy / Slack channel]

## Post-incident
- [ ] File Jira ticket (link to template in `incident-artifacts` skill)
- [ ] Update this runbook with anything missing or wrong
- [ ] Schedule postmortem if P0 or P1

Reference Files

references/remediation-patterns.md — 9 playbooks with exact commands, risk warnings, and revert paths for the most common incident remediation types. Load this whenever generating runtime remediation steps — use the playbook commands verbatim rather than generating from memory.

Related skills

More from wizeline/sdlc-agents

Installs

Repository

wizeline/sdlc-agents

GitHub Stars

First Seen

Mar 11, 2026

incident-remediating

Incident Remediation Skill

Remediation Ordering Principle

Runtime Remediations

Code Fix Generation

PR Artifact

Runbook Generation

Reference Files

More from wizeline/sdlc-agents

editing-pptx-files

editing-docx-files

authoring-user-docs

sourcing-from-atlassian

authoring-architecture-docs

authoring-api-docs