incident-triaging

Installation

SKILL.md

Incident Triage Skill

Your job is to take the first, often incomplete report of something broken and rapidly produce two things: a severity classification and the minimum viable context needed to begin root cause analysis. Speed matters here — every minute of ambiguity at the triage stage costs time downstream.

This skill covers all SDLC stages: local dev, CI/CD, staging/QA, and production.

Phase 0: Intake

When an incident is first reported, extract what you can immediately from the raw input — don't ask questions you can already answer from what's been provided.

Minimum context to establish before handing off:

Field	What to look for
Stage	Keywords: local, CI, staging, prod, pipeline, review app
Symptom	The observable signal: error message, metric name, user complaint, alert title
When	Onset time, or correlation with a deploy/config change
Scope	Users affected, % of traffic, which services, which endpoints
System context	Language, framework, infra (k8s, ECS, Lambda), key services involved

If the developer pastes a stack trace, error log, or raw error message — begin analysis immediately and ask for missing context in parallel. Never make them wait while you ask for things you don't yet need.

If you have enough to classify severity, do it — you can refine later.

Phase 1: Triage & Severity Classification

Classify every incident using this matrix. Severity determines urgency, escalation path, and which remediation options are on the table. Getting this wrong in either direction is costly: over-triaging burns the team; under-triaging delays the response.

P0 — Critical / SEV1
  Conditions (any one is sufficient):
  - Complete outage of a customer-facing system
  - Data loss or corruption actively in progress
  - Active security breach or unauthorized access
  - SLA breach imminent in <15 minutes
  Response: Immediate — page everyone now, all hands, no async

P1 — High / SEV2
  Conditions (any one is sufficient):
  - Major feature broken for a significant percentage of users
  - Performance degraded >2x baseline latency on a critical path
  - Payment, auth, or billing flows impaired (even partially)
  - Error rate >5% above baseline and climbing
  Response: 30-minute SLA, on-call engineer takes ownership

P2 — Medium / SEV3
  Conditions:
  - Non-critical feature broken, workaround exists
  - Error rate elevated but below SLA threshold
  - Staging environment broken, blocking releases
  - CI pipeline broken, blocking merges
  Response: 4-hour SLA, owned in current sprint

P3 — Low / SEV4
  Conditions:
  - Cosmetic issue, no functional impact
  - Single flaky test in CI (not a pattern)
  - Developer local environment issue
  - Low-CVSS vulnerability with no active exploit
  Response: Backlog — Jira ticket, no paging

Escalation triggers to watch for during triage

Even before RCA, flag these conditions immediately:

Any mention of data being written to a system that's behaving incorrectly → potential data integrity risk → escalate
Auth or session tokens involved in the error path → security implication, treat as P1 minimum
Third-party payment processor in the error path → revenue impact, treat as P1 minimum
Incident started during a maintenance window or migration → coordinated failure, higher severity

Output: Triage Summary Block

Produce this block for every incident before handing off to RCA. It is the contract between triage and the next phase — fill every field, even if with "unknown."

INCIDENT TRIAGE
───────────────
Severity    : P[0-3]
Stage       : [local | ci | staging | prod]
Symptom     : [one precise sentence — not "it's broken", e.g., "HTTP 503s on /api/checkout for ~30% of users"]
Blast Radius: [e.g., "~2,000 users, checkout flow only" or "all CI builds blocked" or "unknown"]
SLA Risk    : [yes — breach in ~N min | no | unknown]
Started     : [HH:MM UTC or "unknown — no timestamp in report"]
Trigger     : [deployment | config change | traffic spike | dependency outage | unknown]
Next Step   : [immediate escalation | begin RCA | monitor for 10min | backlog]

Handoff

After producing the Triage Block:

P0/P1: If no Jira ticket was provided initially, instruct the Incident Commander to create one. Then pass the full context to incident-analyzing immediately. State: "Handing off to RCA — time is critical."
P2: If no Jira ticket was provided initially, instruct the Incident Commander to create one. Then pass to incident-analyzing with a note that the 4-hour window gives room for methodical investigation.
P3: Pass to incident-documenting to generate a Jira ticket. No RCA needed unless pattern repeats.

If the developer needs to page someone right now (P0/P1), also invoke incident-documenting in parallel to generate an escalation brief they can send immediately.

Related skills

More from wizeline/sdlc-agents

Installs

Repository

wizeline/sdlc-agents

GitHub Stars

First Seen

Mar 11, 2026

Security Audits

SnykFail

incident-triaging

Incident Triage Skill

Phase 0: Intake

Phase 1: Triage & Severity Classification

Escalation triggers to watch for during triage

Output: Triage Summary Block

Handoff

More from wizeline/sdlc-agents

editing-pptx-files

editing-docx-files

authoring-user-docs

sourcing-from-atlassian

authoring-architecture-docs

authoring-api-docs