incident-runbook-generator
Incident Runbook Generator
Create actionable runbooks for common incidents.
Runbook Template
# Runbook: Database Connection Pool Exhausted
**Severity:** P1 (Critical)
**Estimated Time to Resolve:** 15-30 minutes
**Owner:** Database Team (On-call)
## Symptoms
- Application errors: "connection pool exhausted"
- Increased API latency (>5s)
- Failed health checks
- CloudWatch alarm: `DatabaseConnectionsHigh`
## Detection
- Alert: DatabaseConnectionPoolExhausted
- Metrics: `active_connections > max_connections * 0.9`
- Logs: "Error: connect ETIMEDOUT"
## Immediate Actions (5 min)
1. **Verify the issue**
```bash
# Check current connections
SELECT count(*) FROM pg_stat_activity;
```
-
Identify long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10; -
Kill blocking queries (if safe)
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND now() - state_change > interval '5 minutes';
Mitigation (10 min)
-
Scale up connection pool (temporary)
# Update RDS parameter group aws rds modify-db-parameter-group \ --db-parameter-group-name prod-params \ --parameters "ParameterName=max_connections,ParameterValue=200" -
Restart application (if needed)
kubectl rollout restart deployment/api -
Monitor recovery
watch -n 5 'psql -c "SELECT count(*) FROM pg_stat_activity;"'
Root Cause Investigation
Check for:
- Recent deployment (new code with connection leaks)
- Traffic spike (legitimate or DDoS)
- Slow queries holding connections
- Connection pool configuration too small
- Application not releasing connections
Rollback Steps
If caused by deployment:
# Rollback to previous version
kubectl rollout undo deployment/api
# Verify
kubectl rollout status deployment/api
Communication Template
Initial (within 5 min):
🚨 INCIDENT: Database connection pool exhausted
Status: Investigating
Impact: API errors and slowness
ETA: 15-30 min
Next update: 10 min
Update (every 10 min):
UPDATE: Killed long-running queries
Status: Mitigating
Impact: Still degraded, improving
Actions: Scaling connection pool
Next update: 10 min
Resolution:
✅ RESOLVED: Database connections normalized
Duration: 25 minutes
Root cause: Connection leak in v2.3.4
Fix: Rolled back to v2.3.3
Follow-up: Bug fix PR #1234
Postmortem: [link]
Prevention
- Add connection pool metrics to dashboards
- Implement connection timeout (30s)
- Add connection leak detection in tests
- Set up pre-deployment load testing
- Review connection pool sizing
Related Runbooks
- Database High CPU
- Slow Database Queries
- Application OOM
## Output Checklist
- [ ] Symptoms documented
- [ ] Detection criteria
- [ ] Step-by-step actions
- [ ] Owner assigned
- [ ] Rollback procedure
- [ ] Communication templates
- [ ] Prevention measures
ENDFILE
More from monkey1sai/openai-cli
multi-tenant-safety-checker
Ensures tenant isolation at query and policy level using Row Level Security, automated testing, and security audits. Prevents data leakage between tenants. Use for "multi-tenancy", "tenant isolation", "RLS", or "data security".
10modal-drawer-system
Implements accessible modals and drawers with focus trap, ESC to close, scroll lock, portal rendering, and ARIA attributes. Includes sample implementations for common use cases like edit forms, confirmations, and detail views. Use when building "modals", "dialogs", "drawers", "sidebars", or "overlays".
10eslint-prettier-config
Configures ESLint and Prettier for consistent code quality with TypeScript, React, and modern best practices. Use when users request "ESLint setup", "Prettier config", "linting configuration", "code formatting", or "lint rules".
9api-security-hardener
Hardens API security with rate limiting, input validation, authentication, and protection against common attacks. Use when users request "API security", "secure API", "rate limiting", "input validation", or "API protection".
9secure-headers-csp-builder
Implements security headers and Content Security Policy with safe rollout strategy (report-only → enforce), testing, and compatibility checks. Use for "security headers", "CSP", "HTTP headers", or "XSS protection".
9security-incident-playbook-generator
Creates response procedures for security incidents with containment steps, communication templates, and evidence collection. Use for "incident response", "security playbook", "breach response", or "IR plan".
9