incident-analyzing
Incident Root Cause Analysis Skill
Your job is to move from symptom to cause — not just to note that two things happened at the same time, but to establish that one caused the other. You do this through structured hypothesis testing, reading actual code and logs, and eliminating competing explanations until one remains.
RCA Methodology
Work through these four steps explicitly. Don't skip to conclusions.
Step 1: Form a hypothesis
State the most probable root cause based on the symptom, timing, and any context provided. Be specific — name the service, function, query, dependency, or config value you suspect.
Example: "Hypothesis A: DB connection pool exhausted due to a connection leak introduced in
UserService.fetchProfile()in the v2.4.1 deploy 2 hours ago."
Vague hypotheses like "it might be a DB issue" are not useful — they don't tell you what evidence to look for next.
Step 2: Identify discriminating evidence
For each hypothesis, specify what telemetry would confirm it and what would refute it. Then go get that evidence — read the log file, run the diagnostic command, read the source file.
Hypothesis A — DB connection pool exhaustion
Confirms: pool utilization metric at 100%; pg_stat_activity shows idle connections holding
Refutes: pool utilization below 80%; errors started immediately at deploy (not hours later)
Never theorize about code you can read directly.
Step 3: Test and update
State explicitly whether the evidence confirms or refutes the hypothesis.
"The pool utilization metric shows only 60% — this refutes Hypothesis A. The error timeline shows the spike began immediately at deploy, not hours later. Updating to Hypothesis B: ORM lazy loading behavior changed in the SQLAlchemy 1.4→2.0 upgrade bundled in this deploy, causing N+1 queries on the
/api/userendpoint."
Keep cycling until one hypothesis survives all available evidence.
Step 4: Write the causal chain
Once converged, output the causal chain in this format. Every arrow must be a causal step, not a correlation.
[SQLAlchemy 2.0 upgrade in deploy v2.4.1]
→ [lazy loading disabled by default — relationship queries no longer batched]
→ [N+1 queries on every /api/user call — 1 query per user record instead of 1 total]
→ [DB CPU at 100%, query latency 8s average]
→ [HTTP 504s for all authenticated endpoints]
→ [checkout flow broken for all logged-in users]
Stack Trace Analysis
When given a stack trace or error log:
- Identify the exception type — what class of error is this? (OOM, NPE, connection error, timeout, deserialization failure, deadlock)
- Find the origin frame — the first frame in the developer's own code, not framework or library internals. That's where investigation starts.
- Classify the error pattern — see the table below for known signatures
- Determine novelty — new error (never appeared before) or regression (was working, now broken)?
- Extract correlation/trace IDs — if present, use them to link log lines across services into a single timeline
In minified JS environments: note that vendor.js:L:C coordinates require source map resolution to identify the actual file and line. Flag this and ask for the .map file or unminified source if available.
Known error pattern signatures
| Exception / Log Pattern | Most Likely Cause | First Diagnostic Step |
|---|---|---|
connection pool timeout / too many connections |
Connection pool exhaustion or leak | Check pool utilization metric over time; look for missing close() in error paths |
OOMKilled / Java heap space / heap out of memory |
Memory leak or undersized limit | Plot memory metric since last deploy — linear climb = leak, step change = new allocation |
deadlock found / lock wait timeout exceeded |
DB lock contention | Check for long-running transactions or missing index on a recently changed query |
certificate has expired / SSL handshake failed |
Cert expiry | openssl s_client -connect <host>:443 — check notAfter field |
no such host / name resolution failed |
DNS / service discovery failure | nslookup <service>, check k8s service and endpoints |
upstream timeout / 504 Gateway Timeout |
Slow downstream dependency | Find the slow span in distributed trace; test latency directly with curl -w "%{time_total}" |
FATAL ERROR: Allocation failed (Node.js) |
V8 heap exhausted | Check --max-old-space-size setting; profile heap with --inspect |
Microservices Topology Reasoning
When the incident spans multiple services, read references/topology-patterns.md for detailed
failure signatures and diagnosis commands.
Core approach:
- Use correlation/trace IDs to link log lines across services into a unified timeline
- Identify the origin service (first to show the error) vs propagation services (downstream victims that appear broken but are actually just receiving failures)
- The origin service is where RCA focuses — propagation services self-heal once the origin is fixed
- Draw a simplified call graph with the failure point annotated; this helps identify the blast radius
Phase 3: Impact Assessment
Before handing off to remediation, quantify the blast radius. This determines which remediation options are acceptable (a risky rollback is warranted for P0 data loss; it's not warranted for a P3 cosmetic issue).
IMPACT ASSESSMENT
─────────────────
Users Affected : [number or % of traffic — be specific about which user segment]
Revenue Path : [yes — [flow name] | no]
Error Budget : [X% remaining this month | already breached | unknown]
Downstream : [services that will fail or degrade if this continues]
Data Integrity : [safe | at risk — [what data] | compromised — [what data]]
Answer this explicitly: Is a revenue-generating flow currently broken? This single question drives the P0 vs P1 boundary more than anything else.
Reference Files
Load these during investigation — don't try to recall failure patterns from memory:
references/topology-patterns.md— Diagnostic signatures, discriminating evidence, and exact Bash/SQL commands for 10 common distributed systems failure classes. Read when you need to match a symptom to a known pattern.
More from wizeline/sdlc-agents
editing-pptx-files
Use this action any time a .pptx file is involved in any way — as input, output, or both. This includes: creating slide decks, pitch decks, or presentations; reading, parsing, or extracting text from any .pptx file (even if the extracted content will be used elsewhere, like in an email or summary); editing, modifying, or updating existing presentations; combining or splitting slide files; working with templates, layouts, speaker notes, or comments. Trigger whenever the user mentions \"deck,\" \"slides,\" \"presentation,\" or references a .pptx filename, regardless of what they plan to do with the content afterward. If a .pptx file needs to be opened, created, or touched, use this action.
25authoring-user-docs
Use when producing user-facing documentation — tutorials, how-to guides, user guides, getting-started guides, installation guides, or onboarding documentation. Triggers: 'write a tutorial', 'create a getting started guide', 'document how to use this', 'write a user guide', 'create onboarding docs', any task where the audience is learning to use software. Always load authoring-technical-docs first.
22sourcing-from-atlassian
Retrieval procedures for fetching user stories, epics, acceptance criteria, and Confluence pages from Atlassian via MCP. Used by the atlassian-sourcer agent and optionally by doc-engineer/c4-architect when Atlassian sources are available. Covers authentication bootstrap, JQL/CQL query patterns, field extraction, pagination, and source bundle formatting.
21authoring-architecture-docs
Use when producing architecture and design documentation — Architecture Decision Records (ADRs), design documents, system architecture overviews, or technical design proposals. Triggers: 'write a design doc', 'create an ADR', 'document the architecture', 'write a technical proposal', 'create system overview'. Always load authoring-technical-docs first.
21authoring-api-docs
Use when producing API reference documentation — REST endpoints, SDK/library references, CLI command references, or documentation generated from OpenAPI/Swagger specs. Triggers: 'document this API', 'generate API reference', 'write SDK docs', 'document these endpoints', any task involving source code with HTTP handlers, route definitions, or OpenAPI specs. Always load authoring-technical-docs first.
20processing-pdfs
Use this action whenever the user wants to do anything with PDF files. This includes reading or extracting text/tables from PDFs, combining or merging multiple PDFs into one, splitting PDFs apart, rotating pages, adding watermarks, creating new PDFs, filling PDF forms, encrypting/decrypting PDFs, extracting images, and OCR on scanned PDFs to make them searchable. If the user mentions a .pdf file or asks to produce one, use this action.
19