runbook-creator
Runbook Creator
Templates and best practices for creating effective operational runbooks.
When to Use This Skill
- Creating runbooks for new services
- Documenting incident response procedures
- Writing operational playbooks
- Standardizing on-call documentation
- Automating common procedures
Runbook Principles
- Actionable: Every step should be executable
- Testable: Verify each step works
- Current: Update when systems change
- Accessible: Available during incidents (not behind VPN-only)
- Linked: Referenced from alerts
Standard Runbook Template
Copy and customize this template:
# [Service Name] - [Issue Type]
## Overview
Brief description of what this runbook addresses.
**Last Updated**: YYYY-MM-DD
**Owner**: [Team/Person]
**Related Alerts**: [Alert names that link here]
## Symptoms
What indicates this issue is occurring:
- [ ] Symptom 1
- [ ] Symptom 2
- [ ] Symptom 3
## Impact
- **Users Affected**: [Description]
- **Severity**: [SEV1/SEV2/SEV3/SEV4]
- **Business Impact**: [Description]
## Prerequisites
- Access to [system/tool]
- Permissions: [required permissions]
- Tools: [required CLI tools]
## Diagnostic Steps
### Step 1: [Verify the Issue]
```bash
# Command to run
kubectl get pods -n production | grep -v Running
Expected Output: [What you should see] If Different: [What to do]
Step 2: [Gather Information]
# Command to run
kubectl logs deployment/my-service -n production --tail=100
Look For: [What to look for in output]
Resolution Steps
Option A: [Quick Fix - e.g., Restart]
Use when: [conditions]
# Step 1: Restart the service
kubectl rollout restart deployment/my-service -n production
# Step 2: Verify pods are coming up
kubectl get pods -n production -w
Verification: [How to confirm fix worked]
Option B: [Rollback]
Use when: [conditions]
# Step 1: Check rollout history
kubectl rollout history deployment/my-service -n production
# Step 2: Rollback to previous version
kubectl rollout undo deployment/my-service -n production
Verification: [How to confirm fix worked]
Verification
How to confirm the issue is resolved:
- Error rate returned to normal
- Latency within SLO
- No related alerts firing
- User-facing functionality working
Escalation
If this runbook doesn't resolve the issue:
- First: Contact [Team/Person] via [Slack/Phone]
- Then: Page [Escalation contact]
- Finally: [Further escalation path]
Related Resources
Revision History
| Date | Author | Change |
|---|---|---|
| YYYY-MM-DD | Name | Initial version |
## Quick Runbook Templates
### Service Restart
```markdown
# [Service] - Restart Procedure
## When to Use
- Service unresponsive
- Memory leak suspected
- After configuration change
## Steps
1. **Notify team**
Post in #incidents: "Restarting [service] due to [reason]"
2. **Restart service**
```bash
kubectl rollout restart deployment/[service] -n [namespace]
-
Monitor rollout
kubectl rollout status deployment/[service] -n [namespace] -
Verify health
kubectl get pods -n [namespace] | grep [service] # All pods should be Running, 1/1 Ready -
Check metrics
- Error rate: [dashboard link]
- Latency: [dashboard link]
Rollback
If restart makes things worse:
kubectl rollout undo deployment/[service] -n [namespace]
### Database Failover
```markdown
# [Database] - Failover Procedure
## When to Use
- Primary database unresponsive
- Planned maintenance
- Primary showing errors
## Prerequisites
- Database admin access
- Verify replica is in sync
## Pre-Failover Checks
1. **Check replication status**
```sql
SELECT * FROM pg_stat_replication;
Verify: state = 'streaming', lag is minimal
- Check replica health
pg_isready -h replica-host -p 5432
Failover Steps
-
Stop writes to primary (if possible)
ALTER SYSTEM SET default_transaction_read_only = on; SELECT pg_reload_conf(); -
Promote replica
pg_ctl promote -D /var/lib/postgresql/data -
Update connection strings
- Update DNS/load balancer to point to new primary
- Or update application config
-
Verify applications reconnected
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
Post-Failover
- Monitor error rates
- Set up new replica from old primary
- Update documentation
### Cache Clear
```markdown
# [Service] - Cache Clear Procedure
## When to Use
- Stale data being served
- Cache corruption suspected
- After data migration
## Impact Assessment
- Cache clear will cause temporary latency spike
- Database load will increase temporarily
## Steps
1. **Notify team**
Post in #incidents: "Clearing [cache] cache due to [reason]"
2. **Clear cache**
**Redis - All keys**:
```bash
redis-cli -h [host] FLUSHALL
Redis - Specific pattern:
redis-cli -h [host] --scan --pattern "user:*" | xargs redis-cli DEL
Application cache:
curl -X POST http://[service]/admin/cache/clear
- Monitor
- Watch cache hit rate recover
- Monitor database load
- Check latency
Verification
- Cache hit rate returning to normal
- No errors from cache operations
- Latency stabilizing
## Runbook Checklist
Before publishing a runbook, verify:
Runbook Quality Checklist:
- Title clearly describes the issue/procedure
- Symptoms section helps identify when to use
- All commands are copy-pasteable
- Expected output documented for each command
- Verification steps confirm success
- Escalation path is clear
- Links to dashboards work
- Tested by someone other than author
- Linked from relevant alerts
## Automation Integration
### Runbook with Automation Hooks
```markdown
# [Service] - Automated Recovery
## Automatic Actions
The following actions run automatically:
1. Pod restart on OOMKilled (Kubernetes)
2. Scale-up on high CPU (HPA)
## Manual Steps (if auto-recovery fails)
### Check why auto-recovery failed
```bash
kubectl describe hpa [service] -n [namespace]
kubectl get events -n [namespace] --sort-by='.lastTimestamp'
Manual intervention
[Steps here]
### Script-Backed Runbook
```markdown
# [Service] - Diagnostic Script
## Quick Diagnosis
Run the diagnostic script:
```bash
./scripts/diagnose-service.sh [service-name]
This script checks:
- Pod status
- Recent logs
- Resource usage
- Dependency health
Interpreting Results
| Result | Meaning | Action |
|---|---|---|
HEALTHY |
All checks pass | No action needed |
DEGRADED |
Some issues | Follow specific recommendations |
CRITICAL |
Major issues | Escalate immediately |
## Common Runbook Categories
Every service should have runbooks for:
Essential Runbooks:
- Service restart
- Rollback deployment
- Scale up/down
- Clear cache
- Database failover (if applicable)
- Dependency failure response
- High error rate investigation
- High latency investigation
## Additional Resources
- [Example Runbooks](references/example-runbooks.md)
- [Runbook Automation](references/automation.md)
More from nik-kale/sre-skills
kubernetes-troubleshooting
Systematic debugging workflows for Kubernetes issues including pod failures, resource problems, and networking. Use when debugging CrashLoopBackOff, OOMKilled, ImagePullBackOff, pod not starting, k8s issues, or any Kubernetes troubleshooting.
8production-readiness
Comprehensive checklist for production deployment readiness covering reliability, observability, security, and operational requirements. Use when preparing for go-live, launch readiness review, production deployment checklist, or assessing if a service is ready for production.
4observability-setup
Guide for implementing metrics, logs, and traces in applications. Use when setting up monitoring, adding instrumentation, configuring dashboards, implementing distributed tracing, or designing alerts and SLOs.
4incident-response
Guide systematic investigation of production incidents including triage, data gathering, impact assessment, and root cause analysis. Use when investigating outages, service degradation, production errors, alerts firing, or when the user mentions incident, outage, downtime, or production issues.
4