incident-documenting
Incident Artifacts Skill
Your job is to produce the written artifacts that turn an incident from a chaotic event into a documented, trackable, learnable organizational record. These artifacts serve three audiences: the team resolving the incident right now (escalation brief), the engineering org tracking it (Jira ticket), and the broader organization learning from it (postmortem).
Every artifact produced by this skill is saved to disk — nothing stays only in chat.
Artifact 1: Escalation Brief
When to produce: Any time a P0 or P1 incident needs to be handed to another team, a manager needs to be informed, or the on-call engineer needs to page someone. Generate this before root cause is confirmed if severity warrants it.
Save to: .docs/escalation-<YYYYMMDD-HHMM>.md
ESCALATION BRIEF
─────────────────
Incident : [Short title — specific enough to distinguish from other incidents]
Severity : P[N] / SEV[N]
Time : [HH:MM UTC — when incident started]
Duration : [How long this has been active]
What's broken:
[One paragraph: observable symptom, affected users/services, business impact]
What's been tried:
[Bullet list of actions taken and their outcomes — be specific, not "we looked at things"]
- Checked DB connection pool → utilization at 98%, not the cause
- Rolled back deploy v2.4.1 → no change in error rate
- Restarted UserService pods → temporary relief, error rate returned after ~10min
Current hypothesis:
[Best current theory about root cause — or "unknown" if no hypothesis yet]
What we need:
[Specific ask from the escalation target — not "help", but "we need X to do Y"]
- DBA access to run `pg_locks` queries on prod DB
- AWS support to check for RDS-level connection limits
Escalate to: [on-call lead | platform team | security | DBA | vendor support]
Contact via: [PagerDuty policy name | Slack channel | phone]
Artifact 2: Jira Ticket
When to produce: After any incident (P0–P3) once the symptom is understood well enough to describe it clearly. For P0/P1, create during or immediately after resolution. For P2/P3, create when triaged.
Save to: .docs/jira-<YYYYMMDD-HHMM>.md
**Issue Type:** Bug / Incident
**Priority:** Blocker / Critical / Major / Minor
**Title:** [Specific and searchable — avoid "prod is down"
Good: "UserService connection pool leak causes checkout 503s after ~2h uptime"
Bad: "Production incident 2024-01-15"]
**Components:** [Service name(s)]
**Labels:** incident, severity-p<N>[, postmortem-needed][, security][, data-integrity]
**Fix Version:** [hotfix tag or next release]
**Linked PRs:** [if known]
---
## Summary
[2–3 sentences: what broke, when, who was affected, business impact.
No jargon that a PM couldn't parse.]
## Root Cause
[The full causal chain — be specific. Name the commit, function, query, or dependency.
"A bug in authentication" is not a root cause. "Connection leak in UserService.fetchProfile()
introduced in commit a3f92b1 (deploy v2.4.1) caused pool exhaustion after ~2 hours of uptime"
is a root cause.]
## Timeline
| Time (UTC) | Event |
|---|---|
| HH:MM | Incident started (first error / alert fired) |
| HH:MM | On-call engineer paged / developer noticed |
| HH:MM | [Key diagnostic step] |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed / rollback executed |
| HH:MM | Error rate returned to baseline |
| HH:MM | Incident resolved |
## Resolution
[What was done to restore service — the actual commands or changes, not just "we fixed it".]
## Follow-up Actions
- [ ] Add regression test: [specific scenario that was missing coverage]
- [ ] Add monitoring alert: [specific signal that would have caught this earlier]
- [ ] Audit [related code/service] for the same class of issue
- [ ] Update runbook: [path to runbook]
- [ ] Schedule postmortem (required for P0/P1)
## Prevention
[What architectural or process change prevents this class of incident from recurring.
Be specific — "add more tests" is not prevention. "Add connection pool utilization alert
at 80% threshold with PagerDuty integration" is prevention.]
Artifact 3: Postmortem Draft
When to produce: After any P0 or P1 incident, or any incident where significant user or revenue impact occurred. Generate the draft immediately after resolution — details fade fast.
Save to: docs/postmortem-<YYYYMMDD>-<slug>.md
The goal of a postmortem is blameless learning — understanding the systemic conditions that allowed the incident to occur, not assigning fault to individuals.
# Postmortem: <Incident Title>
**Date:** YYYY-MM-DD
**Severity:** P[N]
**Duration:** [N hours N minutes — from first alert to resolution]
**Author(s):** [Names]
**Status:** Draft / In Review / Final
**Reviewed by:** [Names — fill in after review meeting]
---
## Impact
[Quantified user and business impact:
- N users affected for N hours
- Revenue impact: estimated $N (or "unknown, under investigation")
- SLA: [met / breached by N minutes]
- Error budget: consumed N% of monthly budget]
## Timeline
[Same timeline as Jira ticket — copy/expand with more detail here]
## Root Cause
[Full causal chain — more detailed than the Jira version. Explain the technical mechanism
in enough depth that an engineer unfamiliar with the system understands it.]
## Contributing Factors
[Systemic conditions that allowed this to happen — these are the things to fix:
- No alert existed for connection pool utilization above 80%
- The deployment checklist didn't include DB migration review
- Source map files weren't deployed alongside minified JS, slowing stack trace resolution
These are not excuses — they're the actual targets for prevention work.]
## What Went Well
[Actions that limited the impact or accelerated resolution:
- On-call engineer was paged within 2 minutes of the alert firing
- Feature flag was in place, enabling partial mitigation without a rollback
- Runbook existed for this failure class and was accurate]
## What Went Poorly
[Honest assessment — blameless, but specific:
- 45 minutes elapsed before root cause was identified because DB metrics weren't in the runbook
- Rollback procedure required manual steps not documented anywhere
- Escalation path was unclear — three engineers each thought another was the primary owner]
## Action Items
| Action | Owner | Due | Priority |
|---|---|---|---|
| Add connection pool alert at 80% | [Name] | YYYY-MM-DD | P1 |
| Update deployment checklist | [Name] | YYYY-MM-DD | P2 |
| Add regression test for fetchProfile | [Name] | YYYY-MM-DD | P2 |
| Update runbook with DB metric steps | [Name] | YYYY-MM-DD | P2 |
## Lessons Learned
[2–3 transferable insights for the broader engineering org — things other teams might apply.]
Artifact 4: Incident Summary (async communication)
When to produce: When the developer needs to communicate incident status to stakeholders asynchronously — Slack, email, status page.
Save to: .docs/summary-<YYYYMMDD-HHMM>.md
**[RESOLVED] <Service Name> <Symptom> — <Date>**
**Status:** Resolved at HH:MM UTC
**Duration:** N hours N minutes
**Impact:** [Users/services affected, what was broken]
**What happened:**
[2–3 sentences: root cause in plain language]
**What we did:**
[Brief list of resolution steps]
**Current state:**
[All systems normal / monitoring for recurrence / follow-up work in progress]
**Next steps:**
- Jira ticket: [link]
- Postmortem scheduled: [date]
Output Discipline
- Save every artifact to disk immediately — use file write tools
- Check for existing
docs/,runbooks/directories before creating new ones - Fill every field, even with "unknown" — blank fields in incident documentation are worse than approximate information
- Timelines must be in UTC with explicit timezone — "around 3pm" is not acceptable in incident docs
- Postmortem action items must have an owner and a due date — unowned action items don't get done
More from wizeline/sdlc-agents
editing-pptx-files
Use this action any time a .pptx file is involved in any way — as input, output, or both. This includes: creating slide decks, pitch decks, or presentations; reading, parsing, or extracting text from any .pptx file (even if the extracted content will be used elsewhere, like in an email or summary); editing, modifying, or updating existing presentations; combining or splitting slide files; working with templates, layouts, speaker notes, or comments. Trigger whenever the user mentions \"deck,\" \"slides,\" \"presentation,\" or references a .pptx filename, regardless of what they plan to do with the content afterward. If a .pptx file needs to be opened, created, or touched, use this action.
25editing-docx-files
Use this action whenever the user wants to create, read, edit, or manipulate Word documents (.docx files). Triggers include: any mention of \"Word doc\", \"word document\", \".docx\", or requests to produce professional documents with formatting like tables of contents, headings, page numbers, or letterheads. Also use when extracting or reorganizing content from .docx files, inserting or replacing images in documents, performing find-and-replace in Word files, working with tracked changes or comments, or converting content into a polished Word document. If the user asks for a \"report\", \"memo\", \"letter\", \"template\", or similar deliverable as a Word or .docx file, use this action. Do NOT use for PDFs, spreadsheets, Google Docs, or general coding tasks unrelated to document generation.
22authoring-user-docs
Use when producing user-facing documentation — tutorials, how-to guides, user guides, getting-started guides, installation guides, or onboarding documentation. Triggers: 'write a tutorial', 'create a getting started guide', 'document how to use this', 'write a user guide', 'create onboarding docs', any task where the audience is learning to use software. Always load authoring-technical-docs first.
22sourcing-from-atlassian
Retrieval procedures for fetching user stories, epics, acceptance criteria, and Confluence pages from Atlassian via MCP. Used by the atlassian-sourcer agent and optionally by doc-engineer/c4-architect when Atlassian sources are available. Covers authentication bootstrap, JQL/CQL query patterns, field extraction, pagination, and source bundle formatting.
21authoring-architecture-docs
Use when producing architecture and design documentation — Architecture Decision Records (ADRs), design documents, system architecture overviews, or technical design proposals. Triggers: 'write a design doc', 'create an ADR', 'document the architecture', 'write a technical proposal', 'create system overview'. Always load authoring-technical-docs first.
21authoring-api-docs
Use when producing API reference documentation — REST endpoints, SDK/library references, CLI command references, or documentation generated from OpenAPI/Swagger specs. Triggers: 'document this API', 'generate API reference', 'write SDK docs', 'document these endpoints', any task involving source code with HTTP handlers, route definitions, or OpenAPI specs. Always load authoring-technical-docs first.
20