ai-sre-incident-response
Installation
SKILL.md
AI SRE Incident Response
Apply SRE rigor to AI systems where incidents include quality regressions, unsafe outputs, and budget explosions.
When to Use This Skill
- An LLM endpoint begins returning degraded or hallucinated answers
- Token spend spikes beyond budget thresholds
- A model provider goes down and traffic must fail over
- Safety guardrails fire at abnormal rates
- A new model deployment causes latency or accuracy regression
Prerequisites
- Prometheus and Alertmanager deployed with scrape targets for AI services
- Grafana dashboards for golden signals (latency, error rate, cost, quality)
- On-call rotation configured in PagerDuty, Opsgenie, or equivalent
- Runbook repository accessible to responders
- Rollback mechanism for model and prompt versions (GitOps or feature flags)
AI Incident Classes
- Availability incident: model/provider unavailable, timeout storm.
- Quality incident: answer accuracy or tool success drops below SLO.
- Safety incident: harmful or policy-violating outputs increase.
- Cost incident: unexpected token or provider spend spike.
Severity Framework
| Severity | Criteria | Response Time | Notification |
|---|---|---|---|
| SEV1 | User-facing outage, compliance risk, data leak | 5 min | Page on-call + incident commander |
| SEV2 | Major degradation in key flows | 15 min | Page on-call |
| SEV3 | Limited impact or internal-only issue | 1 hour | Slack alert |
| SEV4 | Cosmetic or low-priority regression | Next business day | Ticket |
Golden Signals for AI Services
- Request success rate
- Latency (queue + generation + tool execution)
- Hallucination/groundedness proxy metrics
- Cost per minute and per tenant
- Guardrail violation rate
Prometheus Alert Rules
# prometheus-ai-alerts.yaml
groups:
- name: ai-service-alerts
rules:
- alert: ModelEndpointDown
expr: up{job="llm-inference"} == 0
for: 2m
labels:
severity: sev1
annotations:
summary: "LLM inference endpoint {{ $labels.instance }} is down"
runbook_url: "https://runbooks.internal/ai/model-outage"
- alert: HighHallucinationRate
expr: |
rate(llm_hallucination_detected_total[10m])
/ rate(llm_requests_total[10m]) > 0.15
for: 5m
labels:
severity: sev2
annotations:
summary: "Hallucination rate above 15% for {{ $labels.model }}"
runbook_url: "https://runbooks.internal/ai/quality-regression"
- alert: TokenCostExplosion
expr: |
sum(rate(llm_token_cost_dollars[5m])) by (tenant)
> 0.50
for: 3m
labels:
severity: sev2
annotations:
summary: "Token spend exceeds $0.50/min for tenant {{ $labels.tenant }}"
runbook_url: "https://runbooks.internal/ai/cost-spike"
- alert: LatencyP95Exceeded
expr: |
histogram_quantile(0.95,
rate(llm_request_duration_seconds_bucket[5m])
) > 5
for: 5m
labels:
severity: sev2
annotations:
summary: "LLM p95 latency exceeds 5s for {{ $labels.service }}"
- alert: GuardrailViolationSpike
expr: |
rate(llm_guardrail_violations_total[10m])
/ rate(llm_requests_total[10m]) > 0.05
for: 5m
labels:
severity: sev1
annotations:
summary: "Guardrail violations above 5% for {{ $labels.model }}"
runbook_url: "https://runbooks.internal/ai/safety-incident"
- alert: ModelQualityDrop
expr: |
llm_eval_score{metric="groundedness"} < 0.70
for: 10m
labels:
severity: sev2
annotations:
summary: "Groundedness score dropped below 0.70 for {{ $labels.model }}"
- alert: ProviderErrorRateHigh
expr: |
rate(llm_provider_errors_total[5m])
/ rate(llm_provider_requests_total[5m]) > 0.10
for: 3m
labels:
severity: sev2
annotations:
summary: "Provider {{ $labels.provider }} error rate above 10%"
Response Playbooks
Model Outage Runbook
TRIGGER: ModelEndpointDown fires for > 2 minutes
RESPONDER: On-call AI platform engineer
1. Acknowledge alert in PagerDuty.
2. Check provider status page (e.g., status.openai.com).
3. Verify network connectivity:
curl -s -o /dev/null -w "%{http_code}" https://api.provider.com/health
4. If provider is down:
a. Enable fallback model route in gateway config.
b. kubectl set env deployment/llm-gateway FALLBACK_ENABLED=true
c. Verify fallback traffic is flowing via Grafana dashboard.
5. If self-hosted model is down:
a. Check pod status: kubectl get pods -l app=llm-inference -n ai
b. Check GPU health: kubectl logs -l app=llm-inference --tail=50
c. Restart if OOM: kubectl rollout restart deployment/llm-inference -n ai
6. Freeze all deployments:
kubectl annotate deployment --all deploy-freeze=true -n ai
7. Communicate ETA in #incident-channel.
8. When resolved, unfreeze and run smoke tests.
Quality Regression Runbook (Hallucination Spike)
TRIGGER: HighHallucinationRate or ModelQualityDrop fires
RESPONDER: On-call AI engineer + ML lead
1. Acknowledge alert. Open incident ticket.
2. Identify scope:
- Which model version? Check deployment metadata.
- Which routes/tenants affected? Filter by labels in Grafana.
3. Check recent changes:
- Model version promotion in last 24h?
- Prompt template changes in last 24h?
- Retrieval index rebuild in last 24h?
4. If recent model change:
kubectl rollout undo deployment/llm-inference -n ai
5. If recent prompt change:
git revert <commit> && git push # triggers GitOps redeploy
6. Increase trace sampling to 100% for affected route:
kubectl set env deployment/llm-gateway TRACE_SAMPLE_RATE=1.0
7. Run offline eval suite against current production:
python run_evals.py --target prod --suite quality --compare baseline
8. Confirm metrics return to baseline before closing.
Token Cost Explosion Runbook
TRIGGER: TokenCostExplosion fires
RESPONDER: On-call platform engineer
1. Identify top consumers:
Query: topk(10, sum(rate(llm_token_cost_dollars[15m])) by (tenant, model, route))
2. Check for runaway loops:
- Agent retry storms (exponential token growth per request)
- Missing max_tokens caps on new routes
- Cache bypass due to config change
3. Apply immediate caps:
kubectl patch configmap llm-quotas -n ai --patch '
data:
max_tokens_per_request: "4096"
rpm_limit: "60"
'
4. Enable semantic cache if disabled:
kubectl set env deployment/llm-gateway CACHE_ENABLED=true
5. Route traffic to cheaper model tier:
kubectl set env deployment/llm-gateway DEFAULT_MODEL=gpt-4o-mini
6. Notify affected tenants of temporary limits.
7. Open postmortem with cost attribution analysis.
Escalation Procedures
Level 1 (0-15 min): On-call AI platform engineer
Level 2 (15-30 min): AI platform team lead + affected product owner
Level 3 (30-60 min): Engineering director + security (if safety incident)
Level 4 (60+ min): VP Engineering + legal (if compliance/data incident)
Safety incidents always start at Level 2 minimum.
Provider-side incidents: open support ticket immediately at Level 1.
Detection Queries (PromQL)
# Request success rate by model
1 - (
sum(rate(llm_requests_total{status="error"}[5m])) by (model)
/ sum(rate(llm_requests_total[5m])) by (model)
)
# Cost per successful answer
sum(rate(llm_token_cost_dollars[5m])) by (route)
/ sum(rate(llm_requests_total{status="success"}[5m])) by (route)
# Hallucination rate trend (1h window, 5m steps)
rate(llm_hallucination_detected_total[1h])
/ rate(llm_requests_total[1h])
# Latency breakdown by stage
histogram_quantile(0.95, rate(llm_retrieval_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(llm_generation_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(llm_tool_execution_duration_seconds_bucket[5m]))
# Tenant cost leaderboard
topk(10, sum(rate(llm_token_cost_dollars[1h])) by (tenant))
Postmortem Requirements
- Timeline with detector and responder timestamps
- Blast radius by tenant and feature
- Missed signals and alert tuning actions
- Concrete hardening tasks with owners and due dates
- Cost impact (dollars, tokens, affected requests)
- Customer communication log
Postmortem Template
## Incident Summary
- **Severity**: SEVx
- **Duration**: start_time - end_time (Xh Ym)
- **Detection**: How was it detected? (alert / customer report / manual)
- **Impact**: X tenants, Y requests, $Z cost
## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | Alert fired |
| HH:MM | Responder acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Incident resolved |
## Root Cause
[Description]
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Tune alert threshold | @engineer | YYYY-MM-DD | Open |
| Add fallback route | @platform | YYYY-MM-DD | Open |
Chaos Engineering for AI Systems
Regularly test incident readiness:
- Provider failover drill: block provider API at network level, verify fallback activates within SLO.
- Model rollback drill: deploy known-bad model version, verify automated quality gate catches it.
- Cost cap drill: simulate runaway token usage, verify quotas trigger before budget threshold.
- Cache failure drill: disable semantic cache, verify system degrades gracefully.
Troubleshooting
| Symptom | Check | Fix |
|---|---|---|
| All requests timing out | Provider status page, DNS resolution | Enable fallback provider |
| Gradual quality decline | Recent model/prompt deployments | Roll back to last known good |
| Sudden cost spike | Per-tenant token usage dashboard | Apply emergency token caps |
| Guardrail violations spike | Model version, prompt injection logs | Enable stricter input filtering |
| Intermittent 503 errors | Pod restarts, GPU OOM events | Increase memory limits or reduce batch size |
Related Skills
- incident-response - Standard incident process and evidence
- alerting-oncall - Paging and escalation policy
- llm-cost-optimization - Spend controls and efficiency patterns
- agent-observability - Instrument requests, traces, and costs
- rag-observability-evals - RAG quality monitoring
Weekly Installs
25
Repository
bagelhole/devop…t-skillsGitHub Stars
18
First Seen
4 days ago
Security Audits