ai-sre-incident-response
SKILL.md
AI SRE Incident Response
Apply SRE rigor to AI systems where incidents include quality regressions, unsafe outputs, and budget explosions.
AI Incident Classes
- Availability incident: model/provider unavailable, timeout storm.
- Quality incident: answer accuracy or tool success drops below SLO.
- Safety incident: harmful or policy-violating outputs increase.
- Cost incident: unexpected token or provider spend spike.
Severity Framework (Example)
- SEV1: user-facing outage, critical compliance risk, or active data leak.
- SEV2: major degradation affecting key flows.
- SEV3: limited impact or internal-only issue.
Golden Signals for AI Services
- Request success rate
- Latency (queue + generation + tool execution)
- Hallucination/groundedness proxy metrics
- Cost per minute and per tenant
- Guardrail violation rate
Response Playbooks
Model Outage
- Freeze deployments.
- Shift traffic to fallback model/provider.
- Enforce stricter rate limits.
- Communicate ETA and mitigation.
Quality Regression
- Roll back prompt/model version.
- Disable risky optimization flags.
- Increase sampling for trace review.
- Re-run latest eval baseline.
Cost Spike
- Identify top tenants/routes/models.
- Enable cache + cheaper fallback path.
- Apply temporary token caps.
- Open postmortem with prevention actions.
Postmortem Requirements
- Timeline with detector and responder timestamps
- Blast radius by tenant and feature
- Missed signals and alert tuning actions
- Concrete hardening tasks with owners and due dates
Related Skills
- incident-response - Standard incident process and evidence
- alerting-oncall - Paging and escalation policy
- llm-cost-optimization - Spend controls and efficiency patterns
Weekly Installs
3
Repository
bagelhole/devop…t-skillsGitHub Stars
13
First Seen
6 days ago
Security Audits
Installed on
opencode3
antigravity3
claude-code3
github-copilot3
codex3
zencoder3