sre-runbooks

Installation

SKILL.md

SRE Runbooks

Overview

Operational runbook templates and SRE practices for production reliability. Covers on-call workflows, incident response, postmortems, capacity planning, and chaos engineering.

SLI/SLO Definition Template

service: payment-api
owner: payments-team
tier: critical

slis:
  availability:
    description: "Ratio of successful HTTP responses (non-5xx) to total responses"
    query: "sum(rate(http_requests_total{service='payment-api', status!~'5..'}[5m])) / sum(rate(http_requests_total{service='payment-api'}[5m]))"
    good_event: "HTTP response with status < 500"
    valid_event: "All HTTP responses (excluding health checks)"

  latency:
    description: "Ratio of requests served faster than 300ms"
    query: "sum(rate(http_request_duration_seconds_bucket{service='payment-api', le='0.3'}[5m])) / sum(rate(http_request_duration_seconds_count{service='payment-api'}[5m]))"
    threshold: 300ms
    percentile: p99

slos:
  availability:
    target: 99.95%
    window: 30d
    error_budget: 21.9 minutes/month

  latency:
    target: 99.0%
    window: 30d
    description: "99% of requests complete within 300ms"

error_budget_policy:
  budget_available:  # >50% remaining
    - "Ship features normally"
    - "Conduct chaos experiments"
    - "Allow risky deployments with rollback plan"
  budget_warning:    # 20-50% remaining
    - "Prioritize reliability work alongside features"
    - "Review recent incidents for patterns"
    - "No chaos experiments without team approval"
  budget_critical:   # <20% remaining
    - "Freeze non-critical deployments"
    - "Mandatory postmortem for any new incident"
    - "Dedicated reliability sprint next cycle"
  budget_exhausted:  # 0% remaining
    - "Complete change freeze for this service"
    - "All engineering effort on reliability"
    - "Escalate to engineering leadership"
    - "Resume normal operations only when budget recovers"

On-Call Handbook

Rotation Structure

Primary:    First responder, handles all pages
Secondary:  Backup if primary doesn't respond in 15 minutes
Escalation: Engineering manager → VP Engineering (for P1 only)
Rotation:   Weekly, handoff every Monday 10:00 AM

On-Call Expectations

Aspect	Requirement
Acknowledge page	Within 5 minutes
Begin triage	Within 15 minutes
Update status page	Within 20 minutes (P1/P2)
Escalate if stuck	After 30 minutes without progress
Max incidents/shift	2 per 12-hour shift (exceeding = process problem)
Compensation	On-call stipend + per-incident for off-hours

Handoff Template

## On-Call Handoff: [date]

### Active Issues
- [Issue]: [status], [next action], [ETA]

### Recent Changes
- [deployment/config change]: [when], [who], [rollback plan]

### Upcoming Risks
- [planned maintenance]: [when], [impact], [owner]
- [known issue]: [workaround], [fix ETA]

### Notes for Next Shift
- [anything unusual observed]

Escalation Matrix

Severity	Description	Response	Escalation
P1	Service fully down, data loss risk	Page immediately	Eng manager at 15min, VP at 30min
P2	Degraded service, partial impact	Page within 15min	Eng manager at 30min
P3	Minor impact, workaround available	Ticket, next business day	Team lead at 48h
P4	Cosmetic, no user impact	Backlog	N/A

Infrastructure Runbooks

High CPU on Nodes

Symptoms: Node CPU >85%, pod throttling, slow responses
Alert: NodeCPUHigh

Diagnosis:
1. Identify top CPU consumers:
   kubectl top pods --all-namespaces --sort-by=cpu | head -20

2. Check if it's one pod or many:
   - One pod: check for CPU-intensive operation, missing limits, infinite loop
   - Many pods: check for traffic spike, external dependency timeout causing retries

3. Check HPA status:
   kubectl get hpa --all-namespaces

Mitigation:
- Immediate: Scale out nodes (Karpenter will auto-provision if configured)
- Short-term: Increase CPU limits for the affected pod, add HPA if missing
- Long-term: Profile the application, optimize hot paths, add caching

Verification:
- kubectl top nodes → CPU <70%
- No pods in Throttling state
- Response latency returns to normal

Pod CrashLoopBackOff

Symptoms: Pod restarts repeatedly, service degraded
Alert: PodCrashLooping

Diagnosis:
1. Check pod events:
   kubectl describe pod POD -n NAMESPACE | tail -30

2. Check previous container logs:
   kubectl logs POD -n NAMESPACE --previous

3. Common causes:
   - OOMKilled → check: kubectl get pod POD -o jsonpath='{.status.containerStatuses[0].lastState}'
   - Application error → check logs for stack trace
   - Missing config/secret → check: kubectl get pod POD -o yaml | grep -A5 envFrom
   - Failed health probe → check probe config vs actual startup time
   - Permission denied → check securityContext, serviceAccount

Mitigation:
- OOMKilled: Increase memory limits (check actual usage first)
- App error: Fix code, rollback deployment if recent change
- Missing config: Create/fix ConfigMap or Secret
- Probe failure: Increase initialDelaySeconds, adjust thresholds
- Permission: Fix RBAC, adjust securityContext

Verification:
- kubectl get pod POD → Running, 0 restarts in last 10 min
- kubectl logs POD → no error messages
- Service responding to health checks

Database Connection Pool Exhaustion

Symptoms: "too many connections", application timeouts, 503 errors
Alert: DatabaseConnectionPoolExhausted

Diagnosis:
1. Check active connections:
   SELECT count(*), state FROM pg_stat_activity GROUP BY state;

2. Check for idle-in-transaction:
   SELECT pid, query, state, query_start FROM pg_stat_activity
   WHERE state = 'idle in transaction' AND query_start < NOW() - INTERVAL '5 minutes';

3. Check for long-running queries:
   SELECT pid, query, state, NOW() - query_start AS duration
   FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;

4. Check application pool settings vs database max_connections

Mitigation:
- Immediate: Kill idle-in-transaction connections:
  SELECT pg_terminate_backend(pid) FROM pg_stat_activity
  WHERE state = 'idle in transaction' AND query_start < NOW() - INTERVAL '10 minutes';
- Short-term: Increase pool size or max_connections (with RAM check)
- Long-term: Fix connection leaks in application, add connection timeouts, use PgBouncer

Verification:
- Active connections < 80% of max_connections
- No idle-in-transaction connections older than 1 minute
- Application 503 errors resolved

Certificate Expiration

Symptoms: TLS errors, browser warnings, API connection failures
Alert: CertificateExpiringSoon (fires at 30 days, 7 days, 1 day)

Diagnosis:
1. Check certificate expiry:
   echo | openssl s_client -connect HOST:443 2>/dev/null | openssl x509 -noout -dates

2. Check Kubernetes TLS secrets:
   kubectl get secret TLS_SECRET -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

3. Check cert-manager status (if used):
   kubectl get certificates --all-namespaces
   kubectl describe certificate CERT_NAME -n NAMESPACE

Mitigation:
- cert-manager: Check Certificate resource, Issuer status, fix any errors
- Manual: Renew certificate, update Kubernetes secret, restart ingress controller
- ACM (AWS): Certificates auto-renew — check DNS validation records exist

Prevention:
- Use cert-manager with Let's Encrypt for automatic renewal
- Alert at 30 days, 7 days, 1 day before expiry
- Never use self-signed certificates in production

Node NotReady in Kubernetes

Symptoms: Pods evicted, scheduling failures, workload disruption
Alert: KubernetesNodeNotReady

Diagnosis:
1. Check node status:
   kubectl get nodes
   kubectl describe node NODE_NAME | grep -A10 Conditions

2. Check kubelet:
   ssh NODE → journalctl -u kubelet --since "30 minutes ago" | tail -50

3. Common causes:
   - Disk pressure: df -h (>85% = kubelet marks NotReady)
   - Memory pressure: free -m (OOM killer active)
   - Network: can node reach API server? curl -k https://APISERVER:6443/healthz
   - kubelet crash: systemctl status kubelet

Mitigation:
- Disk pressure: Clean up logs, images: crictl rmi --prune
- Memory: Identify OOM-killed processes, restart heavy workloads
- Network: Check security groups, route tables, VPN
- kubelet: Restart kubelet, check certificates

- If node won't recover:
  kubectl drain NODE_NAME --ignore-daemonsets --delete-emptydir-data
  (Karpenter will replace the node automatically)

Verification:
- kubectl get nodes → all Ready
- Pods rescheduled and running on healthy nodes
- No PodDisruptionBudget violations

Blameless Postmortem Template

# Postmortem: [Incident Title]

**Date**: YYYY-MM-DD  **Severity**: P1/P2/P3  **Duration**: X hours Y minutes
**Authors**: [names]  **Reviewers**: [names]  **Status**: Draft / Action Items In Progress / Closed

## Executive Summary
[2-3 sentences: what happened, impact, resolution]

## Impact
- **Users affected**: [count or percentage]
- **Duration**: [minutes]
- **Revenue impact**: [estimate or "not applicable"]
- **SLO impact**: [X% of error budget consumed]
- **Data loss**: [yes/no, details if yes]

## Timeline (all times in UTC)

| Time | Event |
|------|-------|
| HH:MM | Triggering event (deployment, config change, traffic spike) |
| HH:MM | First alert fired: [alert name] |
| HH:MM | On-call acknowledged |
| HH:MM | Initial triage: [what was checked] |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied: [what was done] |
| HH:MM | Service partially recovered |
| HH:MM | Service fully recovered |
| HH:MM | All-clear communicated |

## Root Cause Analysis

### 5 Whys
1. **Why** did the service go down? → [proximate cause]
2. **Why** did that happen? → [intermediate cause]
3. **Why** did that happen? → [deeper cause]
4. **Why** did that happen? → [systemic cause]
5. **Why** did that happen? → [ROOT CAUSE]

### Technical Details
[Detailed technical explanation with evidence: logs, metrics, code references]

## Contributing Factors
- [Factor that amplified the impact but isn't the root cause]
- [Factor that delayed detection or response]

## What Went Well
- [Things that worked during the incident]
- [Effective responses or tools]

## What Could Be Improved
- [Detection gaps]
- [Response delays]
- [Communication issues]
- [Tool limitations]

## Action Items

| # | Action | Type | Owner | Priority | Due | Status |
|---|--------|------|-------|----------|-----|--------|
| 1 | [Specific fix for root cause] | Prevent | [name] | P1 | [date] | Open |
| 2 | [Detection improvement] | Detect | [name] | P1 | [date] | Open |
| 3 | [Process improvement] | Process | [name] | P2 | [date] | Open |
| 4 | [Runbook update] | Document | [name] | P2 | [date] | Open |

## Lessons Learned
- [Systemic insight applicable beyond this incident]
- [Pattern to watch for in similar systems]

Postmortem rules:

Focus on systems, not individuals
Every P1/P2 gets a postmortem within 5 business days
Action items have owners, priorities, and deadlines
Close postmortem only when all P1 action items are complete
Share with entire engineering org to spread learnings

Capacity Planning Template

## Capacity Plan: [Service Name]

**Current baseline** (date: YYYY-MM-DD):
- Peak QPS: [value]
- Average CPU utilization at peak: [%]
- Average memory utilization at peak: [%]
- Current replica count: [N]
- Current node count: [N]

**Growth projection**:
- Monthly growth rate: [%]
- Projected peak QPS in 3 months: [value]
- Projected peak QPS in 6 months: [value]
- Seasonal multiplier: [e.g., 3x for Black Friday]

**Scaling limits**:
- Max replicas (HPA): [N]
- Max nodes (Karpenter/ASG): [N]
- Database max connections: [N]
- Known bottleneck: [component] at [threshold]

**Action required by [date]**:
- [ ] Increase [resource] from [current] to [target]
- [ ] Optimize [bottleneck] before reaching [threshold]
- [ ] Load test at [projected peak * 1.5]

Chaos Engineering Playbook

Pre-Requisites

SLOs defined and monitored
Error budget available (>20%)
Runbooks exist for expected failure modes
Team has agreed to the experiment scope

Experiment Template

## Chaos Experiment: [Title]

**Hypothesis**: If [failure condition], the system should [expected behavior] within [time].

**Steady state**: [metrics that define "normal" — SLIs within SLO]

**Method**: [What to break]
- Scope: [namespace/service/node]
- Duration: [how long]
- Blast radius: [what's affected]

**Abort conditions**: [when to stop immediately]
- SLI drops below [threshold]
- Error rate exceeds [%]
- Customer reports

**Results**:
- Hypothesis confirmed? [yes/no]
- Unexpected behavior: [description]
- Action items: [list]

Common Experiments

Experiment	Tool	Tests
Kill random pod	`kubectl delete pod`	Auto-recovery, PDB enforcement
Inject latency	Chaos Mesh / Litmus	Circuit breakers, timeouts, retries
Network partition	Chaos Mesh	Graceful degradation, fallbacks
Fill disk	`dd if=/dev/zero`	Alerts, log rotation, eviction
DNS failure	Block DNS resolution	Caching, error handling, fallbacks
Dependency unavailable	Block egress to service	Circuit breaker, cached responses

Game Day Planning

Schedule: 2 hours, business hours, all relevant engineers present
Scope: One service or failure domain per game day
Run: Execute experiments, observe responses, take notes
Debrief: Immediately after — what worked, what didn't, action items
Cadence: Monthly for critical services, quarterly for others

Toil Reduction Framework

Identify Toil

Toil is work that is: manual, repetitive, automatable, reactive, scales with service growth, has no enduring value.

Catalog Template

Task	Frequency	Duration	Automatable?	Priority
Restart pods after deploy	Daily	10 min	Yes	High
Rotate certificates	Monthly	30 min	Yes (cert-manager)	High
Clean up old logs	Weekly	15 min	Yes (logrotate)	Medium
Review alerts	Daily	20 min	Partially (tune alerts)	Medium
Manual scaling	Variable	15 min	Yes (HPA/Karpenter)	High

Target

SREs spend <50% time on toil, >50% on engineering projects
Track toil hours per sprint
Prioritize automation by: frequency * duration * error_risk

Related skills

More from pfangueiro/claude-code-agents

Installs

Repository

pfangueiro/clau…e-agents

GitHub Stars

First Seen

Apr 12, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykWarn