runbook-creation
Installation
SKILL.md
Runbook Creation
Create effective operational runbooks, standard operating procedures, and troubleshooting guides that any on-call engineer can follow under pressure.
Runbook Template — Full Structure
# Runbook: [Service / Process Name]
**Owner:** [Team or individual]
**Last Reviewed:** YYYY-MM-DD
**Version:** X.Y
**Severity if unavailable:** SEV[1-4]
---
## Overview
Brief description of the service, why this runbook exists, and when to
use it.
## Prerequisites
- [ ] Required access / IAM role: [details]
- [ ] Tools installed: [kubectl, aws-cli, psql, etc.]
- [ ] VPN connected to [environment]
- [ ] Communication channel open: [Slack #channel]
## Procedure
### Step 1 — [Action Name]
[Explanation of what this step does and why.]
```bash
# command here
```
**Expected output:** [describe what success looks like]
### Step 2 — [Action Name]
```bash
# command here
```
**Expected output:** [description]
*(Continue with numbered steps...)*
## Verification
How to confirm the procedure succeeded:
- [ ] [Check 1 — e.g., health endpoint returns 200]
- [ ] [Check 2 — e.g., no errors in logs for 5 minutes]
- [ ] [Check 3 — e.g., metrics return to baseline]
## Rollback
If the procedure fails or causes unexpected issues:
### Rollback Step 1
```bash
# rollback command
```
### Rollback Step 2
```bash
# rollback command
```
## Troubleshooting
| Symptom | Likely Cause | Resolution |
|---------|-------------|------------|
| [symptom 1] | [cause] | [fix] |
| [symptom 2] | [cause] | [fix] |
## Escalation
If unresolved after [X] minutes:
- **Primary:** @[team-lead] — [phone/Slack]
- **Secondary:** @[manager] — [phone/Slack]
## Related Runbooks
- [Link to related runbook 1]
- [Link to related runbook 2]
## Change Log
| Date | Author | Change |
|------|--------|--------|
| YYYY-MM-DD | [Name] | Initial version |
Example Runbook — Database Failover
# Runbook: PostgreSQL Database Failover
**Owner:** Platform / DBA team
**Last Reviewed:** 2025-06-15
**Version:** 2.1
**Severity if unavailable:** SEV1
---
## Overview
Failover the primary PostgreSQL instance to the synchronous replica when
the primary is unreachable or degraded. This runbook covers both planned
(maintenance) and unplanned (emergency) failover.
## Prerequisites
- [ ] DBA or SRE-level access to primary and replica hosts
- [ ] `psql` client installed (v14+)
- [ ] VPN connected to production network
- [ ] Slack channel #db-ops open
- [ ] Confirm replica is in sync: replication lag < 1 MB
## Procedure
### Step 1 — Verify Replica Health
```bash
psql -h replica.db.internal -U dba -d postgres -c \
"SELECT pg_is_in_recovery(), pg_last_wal_replay_lsn();"
```
**Expected output:** `pg_is_in_recovery = t`, LSN advancing.
### Step 2 — Stop Application Writes
```bash
kubectl scale deployment api-server --replicas=0 -n production
kubectl scale deployment worker --replicas=0 -n production
```
**Expected output:** Deployments scaled to 0 pods.
### Step 3 — Confirm Write Quiesce
```bash
psql -h primary.db.internal -U dba -d postgres -c \
"SELECT count(*) FROM pg_stat_activity WHERE state = 'active' AND query !~ 'pg_stat';"
```
**Expected output:** Count = 0 (no active queries).
### Step 4 — Promote Replica
```bash
psql -h replica.db.internal -U dba -d postgres -c "SELECT pg_promote();"
```
Wait up to 30 seconds, then confirm:
```bash
psql -h replica.db.internal -U dba -d postgres -c "SELECT pg_is_in_recovery();"
```
**Expected output:** `pg_is_in_recovery = f` (no longer a replica).
### Step 5 — Update DNS
```bash
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "db.internal.example.com",
"Type": "CNAME",
"TTL": 60,
"ResourceRecords": [{"Value": "replica.db.internal"}]
}
}]
}'
```
### Step 6 — Restart Application
```bash
kubectl scale deployment api-server --replicas=6 -n production
kubectl scale deployment worker --replicas=4 -n production
```
## Verification
- [ ] `psql -h db.internal.example.com -c "SELECT 1;"` returns successfully
- [ ] Application logs show successful DB connections (no errors for 5 min)
- [ ] Transaction throughput returns to baseline on Grafana dashboard
- [ ] No replication-lag alerts firing
## Rollback
If the promoted replica has issues, restore from the most recent backup:
```bash
# Restore latest automated snapshot (RDS example)
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier prod-db-restored \
--db-snapshot-identifier prod-db-latest-snapshot
```
## Escalation
If unresolved after 15 minutes:
- **Primary:** @dba-lead — +1-555-0101
- **Secondary:** @platform-oncall — +1-555-0102
Automation Scripts for Common Operations
Service Health Check
#!/usr/bin/env bash
# health-check.sh — Check health of critical services
set -euo pipefail
SERVICES=(
"https://api.example.com/healthz"
"https://app.example.com/healthz"
"https://admin.example.com/healthz"
)
EXIT_CODE=0
for url in "${SERVICES[@]}"; do
HTTP_CODE=$(curl -so /dev/null -w '%{http_code}' --max-time 5 "$url" 2>/dev/null || echo "000")
if [ "$HTTP_CODE" -eq 200 ]; then
printf " OK %s\n" "$url"
else
printf " FAIL %s (HTTP %s)\n" "$url" "$HTTP_CODE"
EXIT_CODE=1
fi
done
exit $EXIT_CODE
Log Collection for Incident Investigation
#!/usr/bin/env bash
# collect-logs.sh — Gather logs from multiple sources for incident review
set -euo pipefail
INCIDENT_ID="${1:?Usage: collect-logs.sh <incident-id>}"
OUTDIR="/tmp/incident-${INCIDENT_ID}"
mkdir -p "$OUTDIR"
echo "Collecting logs for incident $INCIDENT_ID..."
# Kubernetes pod logs (last 30 min)
kubectl logs -l app=api-server -n production --since=30m \
> "${OUTDIR}/api-server-pods.log" 2>&1
# CloudWatch Logs (last 30 min)
aws logs filter-log-events \
--log-group-name /ecs/production/api \
--start-time "$(date -d '30 minutes ago' +%s)000" \
--output text > "${OUTDIR}/cloudwatch-api.log" 2>&1
# Database slow query log
psql -h db.internal -U dba -d postgres -c \
"SELECT * FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start;" \
> "${OUTDIR}/db-active-queries.log" 2>&1
# System resource snapshot
kubectl top pods -n production > "${OUTDIR}/pod-resources.log" 2>&1
echo "Logs saved to $OUTDIR"
tar czf "${OUTDIR}.tar.gz" -C /tmp "incident-${INCIDENT_ID}"
echo "Archive: ${OUTDIR}.tar.gz"
Certificate Expiry Check
#!/usr/bin/env bash
# cert-check.sh — Warn if TLS certificates expire within 30 days
set -euo pipefail
DOMAINS=(
"api.example.com"
"app.example.com"
"admin.example.com"
)
WARN_DAYS=30
TODAY=$(date +%s)
EXIT_CODE=0
for domain in "${DOMAINS[@]}"; do
EXPIRY=$(echo | openssl s_client -servername "$domain" -connect "${domain}:443" 2>/dev/null \
| openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$EXPIRY" +%s 2>/dev/null || echo 0)
DAYS_LEFT=$(( (EXPIRY_EPOCH - TODAY) / 86400 ))
if [ "$DAYS_LEFT" -lt "$WARN_DAYS" ]; then
printf " WARN %s expires in %d days (%s)\n" "$domain" "$DAYS_LEFT" "$EXPIRY"
EXIT_CODE=1
else
printf " OK %s — %d days remaining\n" "$domain" "$DAYS_LEFT"
fi
done
exit $EXIT_CODE
Disk Space Cleanup
#!/usr/bin/env bash
# disk-cleanup.sh — Free disk space on a host
set -euo pipefail
echo "=== Disk Usage Before ==="
df -h /
# Remove old journal logs (> 7 days)
journalctl --vacuum-time=7d 2>/dev/null || true
# Clean Docker artifacts
docker system prune -f --volumes 2>/dev/null || true
# Remove old log files
find /var/log -name "*.gz" -mtime +7 -delete 2>/dev/null || true
find /tmp -type f -mtime +3 -delete 2>/dev/null || true
echo "=== Disk Usage After ==="
df -h /
Runbook Review Checklist
Use this checklist every time a runbook is created or updated.
content_review:
- [ ] Title clearly identifies the service and operation
- [ ] Overview explains WHEN and WHY to use this runbook
- [ ] Prerequisites list all required access, tools, and setup
- [ ] Every step has a concrete command (no vague instructions)
- [ ] Expected output is documented for each step
- [ ] Verification section confirms success with specific checks
- [ ] Rollback section exists and has been tested
- [ ] Escalation contacts are current (names, phones, Slack handles)
- [ ] Troubleshooting table covers the top 3-5 known failure modes
usability_review:
- [ ] A new team member can follow the runbook without tribal knowledge
- [ ] Steps are numbered and sequential (no branching without clear labels)
- [ ] Commands can be copy-pasted (no placeholder values without explanation)
- [ ] Time estimates included for long-running steps
- [ ] No jargon or acronyms used without definition
maintenance_review:
- [ ] Owner and last-reviewed date are set
- [ ] Version number incremented
- [ ] Change log entry added
- [ ] Related runbooks section is up to date
- [ ] Links to dashboards and docs are valid (not broken)
Runbook Testing Procedures
testing_strategy:
dry_run:
frequency: "Every time a runbook is created or substantially edited"
method: "Walk through each step in a staging environment"
goal: "Verify commands work and output matches documentation"
peer_review:
frequency: "Every edit"
method: "Another engineer follows the runbook in staging without help"
goal: "Confirm the runbook is self-contained and unambiguous"
scheduled_validation:
frequency: "Quarterly"
method: "SRE team picks 5 runbooks at random, executes in staging"
goal: "Catch runbooks that have drifted from production reality"
incident_triggered:
trigger: "Any time a runbook is used in a real incident"
method: "Post-mortem includes runbook accuracy assessment"
goal: "Capture improvements while the experience is fresh"
automation_testing:
method: "CI pipeline validates bash scripts with shellcheck and dry-run"
example: |
# .github/workflows/runbook-lint.yml
name: Lint Runbook Scripts
on: [pull_request]
jobs:
shellcheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: ShellCheck
run: |
find runbooks/ -name "*.sh" -exec shellcheck {} +
Versioning Strategy
versioning:
storage: "Git repository — one directory per service, one file per runbook"
naming: "runbooks/<service>/<operation>.md"
branching: "PRs required for all changes; reviewed by service owner"
version_scheme:
format: "MAJOR.MINOR"
major_bump: "Procedure changes that alter the steps or their order"
minor_bump: "Clarifications, typo fixes, updated contact info"
directory_layout: |
runbooks/
api-server/
deploy.md
rollback.md
scale-up.md
database/
failover.md
backup-restore.md
vacuum-maintenance.md
infrastructure/
dns-update.md
certificate-renewal.md
disk-cleanup.md
review_requirements:
- PR must be approved by the service owner
- CI must pass (shellcheck for scripts, markdown lint)
- Reviewer confirms they can follow the steps independently
retention: "Git history serves as full audit trail — never delete old versions"
Runbook Index Template
Keep a top-level index so engineers can find the right runbook quickly.
# Runbook Index
| Service | Runbook | Severity | Owner | Last Tested |
|---------|---------|----------|-------|-------------|
| API Server | [Deploy](api-server/deploy.md) | — | @platform | 2025-05-01 |
| API Server | [Rollback](api-server/rollback.md) | SEV1 | @platform | 2025-05-01 |
| Database | [Failover](database/failover.md) | SEV1 | @dba | 2025-04-15 |
| Database | [Backup Restore](database/backup-restore.md) | SEV2 | @dba | 2025-04-15 |
| Infra | [DNS Update](infrastructure/dns-update.md) | SEV2 | @sre | 2025-06-01 |
| Infra | [Cert Renewal](infrastructure/certificate-renewal.md) | SEV3 | @sre | 2025-06-01 |
Best Practices
- Write runbooks for the engineer at 3 AM — clear, sequential, copy-pasteable
- Include expected output so the operator knows if a step succeeded
- Always provide a rollback path; every action should be reversible
- Test runbooks in staging before they are needed in production
- Keep runbooks in version control alongside the code they support
- Assign an owner to every runbook; ownerless runbooks rot fast
- After every incident, update the relevant runbook with lessons learned
- Automate repetitive runbook steps into scripts, but keep the runbook as the orchestration guide so operators understand the "why"
Weekly Installs
33
Repository
bagelhole/devop…t-skillsGitHub Stars
18
First Seen
5 days ago
Security Audits