cx-incident-management

Installation
SKILL.md

Incident Management Skill

Use this skill as the gateway for incident triage, SLO monitoring, and notification verification. It orchestrates the full triage workflow - from detection through resolution - and cross-references cx-alerts for deep alert management and cx-telemetry-querying for root cause investigation.


CLI Commands

Command Subcommands Purpose
cx incidents list, get, acknowledge, resolve, close, assign, unassign, events, aggregations Manage and triage incidents
cx slos list, get, create, update, delete Monitor and manage SLO definitions
cx alerts list, get Check which alerts are firing (see cx-alerts skill for full alert management)
cx notifications connectors list, get Verify notification connector configuration
cx notifications routers list, get Verify notification routing rules
cx notifications presets list, get Check notification preset templates
cx notifications test connector, destination, preset, routing-condition, template-render Test notification delivery

Key flags:

  • cx incidents list supports --status (TRIGGERED, ACKNOWLEDGED, RESOLVED), --severity (CRITICAL, WARNING, INFO), --assignee
  • All commands support -o json for structured output and -p <profile> for profile selection
  • cx slos create/update use --from-file <path> (or - for stdin)

Incident Triage Workflow

Step 1: Check Active Incidents

cx incidents list -o json
cx incidents list --status TRIGGERED -o json
cx incidents list --severity CRITICAL -o json

Get an overview of what's happening. Filter by severity for immediate priorities:

cx incidents list -o json | jq '[.[] | select(.severity == "CRITICAL") | {id, name, status, severity, started_at}]'

Step 2: Get Incident Details

cx incidents get <incident-id> -o json
cx incidents events --incident-id <incident-id> -o json

Review the incident timeline and related events to understand scope and progression.

Step 3: Check Related Alerts

cx alerts list -o json

Find which alerts are currently firing. For deep alert inspection, switch to the cx-alerts skill.

cx alerts list -o json | jq '[.[] | select(.is_active == true) | {id, name, severity, last_triggered}]'

Step 4: Review SLO Status

cx slos list -o json
cx slos get <slo-id> -o json

Check if SLOs are breaching or error budgets are burned:

cx slos list -o json | jq '[.[] | {name, status, remaining_budget_percentage}]'

Step 5: Verify Notifications

cx notifications connectors list -o json
cx notifications routers list -o json
cx notifications presets list -o json

Confirm the right people were notified through the correct channels.

Step 6: Pivot to Root Cause

Switch to the cx-telemetry-querying skill to investigate the underlying cause using logs, traces, and metrics.


Incident Actions

Acknowledge

cx incidents acknowledge <incident-id>
cx incidents acknowledge <id1> <id2> <id3>

Resolve

cx incidents resolve <incident-id>
cx incidents resolve <id1> <id2> <id3>

Assign

cx incidents assign <incident-id> --user-id <user-id>

Close

cx incidents close <incident-id>

SLO Management

Creating SLOs

Template from an existing SLO:

cx slos get <existing-slo-id> -o json > slo-template.json
# Edit slo-template.json with new service/threshold
cx slos create --from-file slo-template.json

Monitoring SLO Health

# All SLOs with their status
cx slos list -o json | jq '[.[] | {name, status, target_percentage, remaining_budget}]'

# SLOs that are breaching
cx slos list -o json | jq '[.[] | select(.status != "OK")]'

Notification Debugging

When notifications aren't reaching the right people:

1. Check Connectors

cx notifications connectors list -o json | jq '[.[] | {id, name, type}]'

Verify the expected channels (Slack, PagerDuty, email) exist and are configured.

2. Check Routers

cx notifications routers list -o json | jq '[.[] | {id, name, entity_type}]'

Verify routing rules map the right alert types to the right connectors.

3. Test Notification Delivery

cx notifications test connector --from-file test-connector.json
cx notifications test destination --from-file test-destination.json
cx notifications test preset --from-file test-preset.json
cx notifications test routing-condition --from-file test-condition.json

Incident Aggregations

Get a high-level view of incident patterns:

cx incidents aggregations -o json

Use this to understand incident frequency, MTTR trends, and severity distribution.


Key Principles

  • Triage before deep-dive - check incidents, alerts, and SLOs before querying telemetry data
  • Check SLO burn rate, not just status - a slowly burning SLO needs attention before it breaches
  • Verify notification chain end-to-end - connector exists → router maps correctly → test delivery works
  • Cross-reference with telemetry - use cx-telemetry-querying skill for root cause after triage
  • Acknowledge promptly - acknowledge incidents to signal ownership and stop re-notifications
  • Use incident events for timeline - cx incidents events shows the full incident lifecycle

Related Skills

  • cx-alerts - deep alert management: creating, updating, and inspecting alert definitions
  • cx-telemetry-querying - root cause investigation using logs, metrics, traces, and RUM
  • cx-observability-setup - configure notification channels and routing for alerts
Related skills
Installs
85
GitHub Stars
98
First Seen
4 days ago