ai-sre-incident-response

Installation
SKILL.md

AI SRE Incident Response

Apply SRE rigor to AI systems where incidents include quality regressions, unsafe outputs, and budget explosions.

When to Use This Skill

  • An LLM endpoint begins returning degraded or hallucinated answers
  • Token spend spikes beyond budget thresholds
  • A model provider goes down and traffic must fail over
  • Safety guardrails fire at abnormal rates
  • A new model deployment causes latency or accuracy regression

Prerequisites

  • Prometheus and Alertmanager deployed with scrape targets for AI services
  • Grafana dashboards for golden signals (latency, error rate, cost, quality)
  • On-call rotation configured in PagerDuty, Opsgenie, or equivalent
  • Runbook repository accessible to responders
  • Rollback mechanism for model and prompt versions (GitOps or feature flags)

AI Incident Classes

  • Availability incident: model/provider unavailable, timeout storm.
  • Quality incident: answer accuracy or tool success drops below SLO.
  • Safety incident: harmful or policy-violating outputs increase.
  • Cost incident: unexpected token or provider spend spike.

Severity Framework

Severity Criteria Response Time Notification
SEV1 User-facing outage, compliance risk, data leak 5 min Page on-call + incident commander
SEV2 Major degradation in key flows 15 min Page on-call
SEV3 Limited impact or internal-only issue 1 hour Slack alert
SEV4 Cosmetic or low-priority regression Next business day Ticket

Golden Signals for AI Services

  • Request success rate
  • Latency (queue + generation + tool execution)
  • Hallucination/groundedness proxy metrics
  • Cost per minute and per tenant
  • Guardrail violation rate

Prometheus Alert Rules

# prometheus-ai-alerts.yaml
groups:
  - name: ai-service-alerts
    rules:
      - alert: ModelEndpointDown
        expr: up{job="llm-inference"} == 0
        for: 2m
        labels:
          severity: sev1
        annotations:
          summary: "LLM inference endpoint {{ $labels.instance }} is down"
          runbook_url: "https://runbooks.internal/ai/model-outage"

      - alert: HighHallucinationRate
        expr: |
          rate(llm_hallucination_detected_total[10m])
          / rate(llm_requests_total[10m]) > 0.15
        for: 5m
        labels:
          severity: sev2
        annotations:
          summary: "Hallucination rate above 15% for {{ $labels.model }}"
          runbook_url: "https://runbooks.internal/ai/quality-regression"

      - alert: TokenCostExplosion
        expr: |
          sum(rate(llm_token_cost_dollars[5m])) by (tenant)
          > 0.50
        for: 3m
        labels:
          severity: sev2
        annotations:
          summary: "Token spend exceeds $0.50/min for tenant {{ $labels.tenant }}"
          runbook_url: "https://runbooks.internal/ai/cost-spike"

      - alert: LatencyP95Exceeded
        expr: |
          histogram_quantile(0.95,
            rate(llm_request_duration_seconds_bucket[5m])
          ) > 5
        for: 5m
        labels:
          severity: sev2
        annotations:
          summary: "LLM p95 latency exceeds 5s for {{ $labels.service }}"

      - alert: GuardrailViolationSpike
        expr: |
          rate(llm_guardrail_violations_total[10m])
          / rate(llm_requests_total[10m]) > 0.05
        for: 5m
        labels:
          severity: sev1
        annotations:
          summary: "Guardrail violations above 5% for {{ $labels.model }}"
          runbook_url: "https://runbooks.internal/ai/safety-incident"

      - alert: ModelQualityDrop
        expr: |
          llm_eval_score{metric="groundedness"} < 0.70
        for: 10m
        labels:
          severity: sev2
        annotations:
          summary: "Groundedness score dropped below 0.70 for {{ $labels.model }}"

      - alert: ProviderErrorRateHigh
        expr: |
          rate(llm_provider_errors_total[5m])
          / rate(llm_provider_requests_total[5m]) > 0.10
        for: 3m
        labels:
          severity: sev2
        annotations:
          summary: "Provider {{ $labels.provider }} error rate above 10%"

Response Playbooks

Model Outage Runbook

TRIGGER: ModelEndpointDown fires for > 2 minutes
RESPONDER: On-call AI platform engineer

1. Acknowledge alert in PagerDuty.
2. Check provider status page (e.g., status.openai.com).
3. Verify network connectivity:
     curl -s -o /dev/null -w "%{http_code}" https://api.provider.com/health
4. If provider is down:
     a. Enable fallback model route in gateway config.
     b. kubectl set env deployment/llm-gateway FALLBACK_ENABLED=true
     c. Verify fallback traffic is flowing via Grafana dashboard.
5. If self-hosted model is down:
     a. Check pod status: kubectl get pods -l app=llm-inference -n ai
     b. Check GPU health: kubectl logs -l app=llm-inference --tail=50
     c. Restart if OOM: kubectl rollout restart deployment/llm-inference -n ai
6. Freeze all deployments:
     kubectl annotate deployment --all deploy-freeze=true -n ai
7. Communicate ETA in #incident-channel.
8. When resolved, unfreeze and run smoke tests.

Quality Regression Runbook (Hallucination Spike)

TRIGGER: HighHallucinationRate or ModelQualityDrop fires
RESPONDER: On-call AI engineer + ML lead

1. Acknowledge alert. Open incident ticket.
2. Identify scope:
     - Which model version? Check deployment metadata.
     - Which routes/tenants affected? Filter by labels in Grafana.
3. Check recent changes:
     - Model version promotion in last 24h?
     - Prompt template changes in last 24h?
     - Retrieval index rebuild in last 24h?
4. If recent model change:
     kubectl rollout undo deployment/llm-inference -n ai
5. If recent prompt change:
     git revert <commit> && git push  # triggers GitOps redeploy
6. Increase trace sampling to 100% for affected route:
     kubectl set env deployment/llm-gateway TRACE_SAMPLE_RATE=1.0
7. Run offline eval suite against current production:
     python run_evals.py --target prod --suite quality --compare baseline
8. Confirm metrics return to baseline before closing.

Token Cost Explosion Runbook

TRIGGER: TokenCostExplosion fires
RESPONDER: On-call platform engineer

1. Identify top consumers:
     Query: topk(10, sum(rate(llm_token_cost_dollars[15m])) by (tenant, model, route))
2. Check for runaway loops:
     - Agent retry storms (exponential token growth per request)
     - Missing max_tokens caps on new routes
     - Cache bypass due to config change
3. Apply immediate caps:
     kubectl patch configmap llm-quotas -n ai --patch '
       data:
         max_tokens_per_request: "4096"
         rpm_limit: "60"
     '
4. Enable semantic cache if disabled:
     kubectl set env deployment/llm-gateway CACHE_ENABLED=true
5. Route traffic to cheaper model tier:
     kubectl set env deployment/llm-gateway DEFAULT_MODEL=gpt-4o-mini
6. Notify affected tenants of temporary limits.
7. Open postmortem with cost attribution analysis.

Escalation Procedures

Level 1 (0-15 min):  On-call AI platform engineer
Level 2 (15-30 min): AI platform team lead + affected product owner
Level 3 (30-60 min): Engineering director + security (if safety incident)
Level 4 (60+ min):   VP Engineering + legal (if compliance/data incident)

Safety incidents always start at Level 2 minimum.
Provider-side incidents: open support ticket immediately at Level 1.

Detection Queries (PromQL)

# Request success rate by model
1 - (
  sum(rate(llm_requests_total{status="error"}[5m])) by (model)
  / sum(rate(llm_requests_total[5m])) by (model)
)

# Cost per successful answer
sum(rate(llm_token_cost_dollars[5m])) by (route)
/ sum(rate(llm_requests_total{status="success"}[5m])) by (route)

# Hallucination rate trend (1h window, 5m steps)
rate(llm_hallucination_detected_total[1h])
/ rate(llm_requests_total[1h])

# Latency breakdown by stage
histogram_quantile(0.95, rate(llm_retrieval_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(llm_generation_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(llm_tool_execution_duration_seconds_bucket[5m]))

# Tenant cost leaderboard
topk(10, sum(rate(llm_token_cost_dollars[1h])) by (tenant))

Postmortem Requirements

  • Timeline with detector and responder timestamps
  • Blast radius by tenant and feature
  • Missed signals and alert tuning actions
  • Concrete hardening tasks with owners and due dates
  • Cost impact (dollars, tokens, affected requests)
  • Customer communication log

Postmortem Template

## Incident Summary
- **Severity**: SEVx
- **Duration**: start_time - end_time (Xh Ym)
- **Detection**: How was it detected? (alert / customer report / manual)
- **Impact**: X tenants, Y requests, $Z cost

## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | Alert fired |
| HH:MM | Responder acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Incident resolved |

## Root Cause
[Description]

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Tune alert threshold | @engineer | YYYY-MM-DD | Open |
| Add fallback route | @platform | YYYY-MM-DD | Open |

Chaos Engineering for AI Systems

Regularly test incident readiness:

  • Provider failover drill: block provider API at network level, verify fallback activates within SLO.
  • Model rollback drill: deploy known-bad model version, verify automated quality gate catches it.
  • Cost cap drill: simulate runaway token usage, verify quotas trigger before budget threshold.
  • Cache failure drill: disable semantic cache, verify system degrades gracefully.

Troubleshooting

Symptom Check Fix
All requests timing out Provider status page, DNS resolution Enable fallback provider
Gradual quality decline Recent model/prompt deployments Roll back to last known good
Sudden cost spike Per-tenant token usage dashboard Apply emergency token caps
Guardrail violations spike Model version, prompt injection logs Enable stricter input filtering
Intermittent 503 errors Pod restarts, GPU OOM events Increase memory limits or reduce batch size

Related Skills

Weekly Installs
25
GitHub Stars
18
First Seen
4 days ago