deployment-runbook
Deployment Runbook
Overview
This skill provides deployment procedures, automated health checks, and rollback strategies to ensure safe, reliable deployments. Use it to standardize deployment workflows and reduce deployment-related incidents.
When to Use This Skill
- Planning production deployments
- Executing staged rollouts (canary, blue-green)
- Performing post-deployment health checks
- Rolling back failed deployments
- Troubleshooting deployment issues
- Establishing deployment best practices
- Complementing the devops-automation agent for deployments
Pre-Deployment Checklist
Before any production deployment:
- Code Review: All changes reviewed and approved
- Tests Pass: CI/CD pipeline green
- Unit tests: β
- Integration tests: β
- E2E tests: β
- Database Migrations: Tested in staging
- Backward compatible
- Rollback script prepared
- Configuration: Environment variables verified
- Secrets rotated if needed
- Feature flags configured
- Monitoring: Dashboards and alerts ready
- Error tracking enabled
- Performance monitoring active
- Log aggregation configured
- Communication: Stakeholders notified
- Deployment window announced
- On-call engineer assigned
- Rollback plan documented
- Backups: Recent backup verified
- Database backed up < 1 hour ago
- Backup restoration tested
- Capacity: Resources scaled appropriately
- Auto-scaling configured
- Rate limits reviewed
- CDN cache warmed
Deployment Strategies
1. Blue-Green Deployment
Best for: Zero-downtime deployments, easy rollbacks
Process:
- Deploy to inactive (green) environment
- Run health checks on green
- Switch traffic from blue to green
- Monitor for issues
- Keep blue as instant rollback option
Commands:
# Deploy to green environment
./deploy.sh --env green
# Run health checks
python3 scripts/health_check.py --env green
# Switch traffic (gradual)
./switch_traffic.sh --from blue --to green --percentage 10
./switch_traffic.sh --from blue --to green --percentage 50
./switch_traffic.sh --from blue --to green --percentage 100
# If issues: instant rollback
./switch_traffic.sh --from green --to blue --percentage 100
2. Canary Deployment
Best for: Risk-averse deployments, gradual rollouts
Process:
- Deploy to small subset of servers (5-10%)
- Monitor metrics closely
- Gradually increase percentage
- Roll back if metrics degrade
Monitoring During Canary:
- Error rate < baseline + 1%
- Response time < baseline + 10%
- Success rate > 99.9%
3. Rolling Deployment
Best for: Standard updates, resource-constrained environments
Process:
- Take one instance out of load balancer
- Deploy new version
- Run health checks
- Add back to load balancer
- Repeat for remaining instances
Deployment Workflow
Phase 1: Pre-Deployment (T-30 minutes)
# 1. Verify staging environment
./verify_staging.sh
# 2. Create deployment tag
git tag -a v1.2.3 -m "Release 1.2.3"
git push origin v1.2.3
# 3. Trigger production build
./build_production.sh --tag v1.2.3
# 4. Backup database
./backup_db.sh --environment production
# 5. Notify team
./notify_slack.sh "π Starting deployment v1.2.3 in 30 minutes"
Phase 2: Deployment (T-0)
# 1. Enable maintenance mode (if needed)
./maintenance_mode.sh --enable
# 2. Run database migrations
./run_migrations.sh --environment production
# 3. Deploy application
./deploy.sh --environment production --version v1.2.3
# 4. Disable maintenance mode
./maintenance_mode.sh --disable
Phase 3: Post-Deployment Health Checks
# Run comprehensive health checks
python3 scripts/health_check.py --environment production
# Expected output:
# β API health endpoint responding
# β Database connectivity OK
# β Cache layer accessible
# β External services reachable
# β Error rate within threshold
# β Response time within SLA
Phase 4: Monitoring (T+30 minutes)
Monitor these metrics:
Application Metrics:
- Error rate: < 0.1%
- Response time (p95): < 200ms
- Request throughput: within expected range
- Success rate: > 99.9%
Infrastructure Metrics:
- CPU utilization: < 70%
- Memory usage: < 80%
- Disk I/O: normal patterns
- Network latency: < 50ms
Business Metrics:
- Conversion rate: no significant drop
- User signups: within expected range
- Transaction volume: normal patterns
Rollback Procedures
When to Rollback
Rollback immediately if:
- Error rate > 1%
- Critical functionality broken
- Data corruption detected
- Security vulnerability introduced
- Performance degradation > 50%
Rollback Methods
Method 1: Traffic Switch (Fastest)
# Blue-green: instant rollback
./switch_traffic.sh --from green --to blue --percentage 100
# Verification
python3 scripts/health_check.py --environment production
Method 2: Version Revert
# Deploy previous version
./deploy.sh --environment production --version v1.2.2
# Run health checks
python3 scripts/health_check.py --environment production
Method 3: Database Rollback
# If migrations were applied
./rollback_migration.sh --environment production --steps 1
# Restore from backup (last resort)
./restore_db.sh --backup latest --environment production
Post-Rollback
-
Verify system health
python3 scripts/health_check.py --environment production -
Notify stakeholders
./notify_slack.sh "β οΈ Deployment v1.2.3 rolled back. System stable on v1.2.2" -
Create postmortem
- What went wrong?
- Why didn't we catch it?
- How do we prevent recurrence?
Health Check Script
Use the included health check script:
# Run all checks
python3 scripts/health_check.py --env production
# Run specific check
python3 scripts/health_check.py --env production --check api
# Verbose output
python3 scripts/health_check.py --env production --verbose
See scripts/health_check.py for implementation.
Troubleshooting Guide
Issue: Deployment Hangs
Symptoms:
- Deployment script doesn't complete
- Services not starting
Diagnosis:
# Check service logs
kubectl logs -f deployment/app-name
# Check events
kubectl get events --sort-by='.lastTimestamp'
Resolution:
- Increase timeout values
- Check resource constraints
- Verify image pull secrets
Issue: High Error Rate Post-Deployment
Symptoms:
- Error rate spike
- 500 errors in logs
Diagnosis:
# Check application logs
tail -f /var/log/app/error.log
# Check error distribution
grep "ERROR" /var/log/app/* | awk '{print $NF}' | sort | uniq -c | sort -nr
Resolution:
- Check configuration changes
- Verify environment variables
- Review recent code changes
- Consider immediate rollback
Issue: Database Connection Failures
Symptoms:
- "Connection refused" errors
- Timeout errors
Diagnosis:
# Test database connectivity
python3 scripts/test_db_connection.py
# Check connection pool
psql -h db-host -U user -c "SELECT * FROM pg_stat_activity;"
Resolution:
- Verify connection strings
- Check firewall rules
- Increase connection pool size
- Verify credentials
Communication Templates
Pre-Deployment Announcement
π **Production Deployment Scheduled**
**Version**: v1.2.3
**Time**: 2024-01-15 14:00 UTC (30 minutes)
**Duration**: ~15 minutes
**Impact**: No expected downtime
**Changes**:
- Feature: New user dashboard
- Fix: Payment processing bug
- Performance: API response time improvements
**Rollback Plan**: Blue-green switch (instant)
**On-Call**: @engineer-name
Deployment Success
β
**Deployment Complete**
**Version**: v1.2.3
**Status**: Successful
**Duration**: 12 minutes
**Health Checks**: All passing β
**Metrics**: Within normal range
**Next Check**: T+30 minutes
Monitoring dashboard: [link]
Deployment Rollback
β οΈ **Deployment Rolled Back**
**Version**: v1.2.3 β v1.2.2 (rollback)
**Reason**: Elevated error rate (2.1%)
**Status**: System stable on v1.2.2
**Action Items**:
- [ ] Root cause analysis
- [ ] Fix identified issue
- [ ] Re-test in staging
- [ ] Schedule re-deployment
Incident report: [link]
Resources
scripts/
- health_check.py: Comprehensive deployment health checks
- test_db_connection.py: Database connectivity verification
references/
- deployment-checklist.md: Detailed pre/post deployment checklist
- monitoring-guide.md: Metrics to monitor during deployments
Best Practices
- Always deploy during low-traffic windows
- Never deploy on Fridays (unless critical hotfix)
- Keep deployments small (< 200 lines changed)
- Monitor for 30+ minutes post-deployment
- Document every rollback with postmortem
- Test rollback procedure in staging first
- Use feature flags for risky changes
- Automate health checks (don't rely on manual verification)
Quick Reference
Emergency Rollback:
./switch_traffic.sh --from green --to blue --percentage 100
Health Check:
python3 scripts/health_check.py --env production
View Logs:
kubectl logs -f deployment/app-name --tail=100
Check Metrics:
curl https://metrics.example.com/api/health