disaster-recovery
Installation
SKILL.md
Disaster Recovery
Implement disaster recovery strategies including RTO/RPO planning, AWS cross-region failover patterns, DR testing procedures, and automated failover scripts.
When to Use
- Defining RTO and RPO targets for critical systems
- Designing multi-region or multi-cloud disaster recovery architectures
- Implementing automated failover and failback procedures
- Conducting DR tests (tabletop, component, full failover)
- Meeting compliance requirements for contingency planning (SOC 2, HIPAA, FedRAMP, ISO 27001)
RTO/RPO Planning
recovery_metrics:
RTO:
definition: "Recovery Time Objective - maximum acceptable downtime"
measurement: "From incident declaration to service restoration"
factors:
- Failover automation maturity
- Data replication lag
- DNS propagation time
- Application warm-up time
- Verification procedures
RPO:
definition: "Recovery Point Objective - maximum acceptable data loss"
measurement: "Time gap between last good backup and the incident"
factors:
- Backup frequency
- Replication method (sync vs. async)
- Transaction log shipping interval
- Cross-region replication lag
service_tier_targets:
tier_1_critical:
examples: "Authentication, payment processing, core API"
rto: "< 15 minutes"
rpo: "< 1 minute (near-zero)"
strategy: "Multi-site active-active or warm standby"
replication: "Synchronous or near-synchronous"
testing: "Quarterly failover test"
tier_2_essential:
examples: "Customer dashboards, reporting, notifications"
rto: "< 1 hour"
rpo: "< 15 minutes"
strategy: "Warm standby or pilot light"
replication: "Asynchronous with short interval"
testing: "Semi-annual failover test"
tier_3_standard:
examples: "Internal tools, analytics, batch processing"
rto: "< 4 hours"
rpo: "< 1 hour"
strategy: "Pilot light or backup and restore"
replication: "Periodic snapshots"
testing: "Annual failover test"
tier_4_non_essential:
examples: "Development environments, documentation sites"
rto: "< 24 hours"
rpo: "< 24 hours"
strategy: "Backup and restore"
replication: "Daily backups"
testing: "Annual backup restore verification"
DR Strategies Comparison
strategies:
backup_and_restore:
rto: "Hours"
rpo: "Hours (depends on backup frequency)"
cost: "$"
description: "Regular backups stored in DR region. Restore from backup when needed."
aws_services:
- "S3 cross-region replication for backups"
- "RDS automated snapshots copied to DR region"
- "AMI copies in DR region"
- "Terraform/CloudFormation for infrastructure rebuild"
pros: "Lowest cost, simplest to maintain"
cons: "Longest recovery time, highest data loss potential"
pilot_light:
rto: "Minutes to hours"
rpo: "Minutes"
cost: "$$"
description: "Core infrastructure running in DR region (databases replicated). Scale up compute on failover."
aws_services:
- "RDS cross-region read replica (always running)"
- "S3 cross-region replication"
- "AMIs pre-built in DR region"
- "Auto Scaling groups at zero/minimal capacity"
pros: "Fast database recovery, moderate cost"
cons: "Compute scale-up adds to recovery time"
warm_standby:
rto: "Minutes"
rpo: "Seconds to minutes"
cost: "$$$"
description: "Scaled-down but functional environment in DR region. Scale up on failover."
aws_services:
- "RDS cross-region read replica"
- "ECS/EKS running at reduced capacity"
- "Route53 health checks for automated DNS failover"
- "Global Accelerator for traffic management"
pros: "Fast failover, reduced risk"
cons: "Higher baseline cost for idle resources"
multi_site_active:
rto: "Near-zero"
rpo: "Near-zero"
cost: "$$$$"
description: "Active-active across regions. Traffic served from both regions simultaneously."
aws_services:
- "DynamoDB Global Tables or Aurora Global Database"
- "Route53 latency/weighted routing"
- "CloudFront with multi-origin"
- "Global Accelerator"
- "ECS/EKS in both regions"
pros: "Minimal downtime and data loss"
cons: "Highest cost, most complex to operate"
AWS Cross-Region DR Implementation
# === Database Replication ===
# Create cross-region RDS read replica
aws rds create-db-instance-read-replica \
--db-instance-identifier prod-db-dr-replica \
--source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:prod-db \
--db-instance-class db.r6g.large \
--region us-west-2 \
--kms-key-id arn:aws:kms:us-west-2:123456789012:alias/rds-dr-key \
--multi-az \
--tags Key=Purpose,Value=DR Key=Environment,Value=production
# Create Aurora Global Database for near-zero RPO
aws rds create-global-cluster \
--global-cluster-identifier prod-global-db \
--source-db-cluster-identifier arn:aws:rds:us-east-1:123456789012:cluster:prod-aurora-cluster \
--region us-east-1
# Add secondary region to Aurora Global Database
aws rds create-db-cluster \
--db-cluster-identifier prod-aurora-dr \
--global-cluster-identifier prod-global-db \
--engine aurora-postgresql \
--region us-west-2 \
--kms-key-id arn:aws:kms:us-west-2:123456789012:alias/aurora-dr-key
# === Storage Replication ===
# S3 cross-region replication
cat > /tmp/replication-config.json << 'EOF'
{
"Role": "arn:aws:iam::123456789012:role/s3-replication-role",
"Rules": [
{
"ID": "ReplicateAll",
"Status": "Enabled",
"Filter": {"Prefix": ""},
"Destination": {
"Bucket": "arn:aws:s3:::prod-data-dr-usw2",
"StorageClass": "STANDARD",
"EncryptionConfiguration": {
"ReplicaKmsKeyID": "arn:aws:kms:us-west-2:123456789012:alias/s3-dr-key"
}
},
"DeleteMarkerReplication": {"Status": "Enabled"}
}
]
}
EOF
aws s3api put-bucket-replication \
--bucket prod-data-use1 \
--replication-configuration file:///tmp/replication-config.json
# === DNS Failover ===
# Route53 health check for primary region
aws route53 create-health-check --caller-reference "prod-health-$(date +%s)" \
--health-check-config '{
"Type": "HTTPS",
"FullyQualifiedDomainName": "api.example.com",
"Port": 443,
"ResourcePath": "/health",
"RequestInterval": 10,
"FailureThreshold": 3,
"EnableSNI": true
}'
# Configure failover routing
aws route53 change-resource-record-sets --hosted-zone-id Z123456 \
--change-batch '{
"Changes": [
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "primary",
"Failover": "PRIMARY",
"AliasTarget": {
"HostedZoneId": "Z1234PRIMARY",
"DNSName": "primary-alb.us-east-1.elb.amazonaws.com",
"EvaluateTargetHealth": true
},
"HealthCheckId": "health-check-id-primary"
}
},
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "secondary",
"Failover": "SECONDARY",
"AliasTarget": {
"HostedZoneId": "Z5678SECONDARY",
"DNSName": "dr-alb.us-west-2.elb.amazonaws.com",
"EvaluateTargetHealth": true
}
}
}
]
}'
Failover Script
#!/usr/bin/env bash
# dr-failover.sh - Execute disaster recovery failover to DR region
set -euo pipefail
DR_REGION="us-west-2"
PRIMARY_REGION="us-east-1"
SLACK_WEBHOOK="${DR_SLACK_WEBHOOK}"
LOG_FILE="/var/log/dr-failover-$(date +%Y%m%d-%H%M%S).log"
log() {
echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] $1" | tee -a "$LOG_FILE"
}
notify() {
curl -s -X POST "$SLACK_WEBHOOK" \
-H "Content-Type: application/json" \
-d "{\"text\":\"DR FAILOVER: $1\"}" > /dev/null
}
log "=== DR Failover Initiated ==="
notify "DR failover initiated to $DR_REGION"
# Step 1: Promote RDS read replica
log "Step 1: Promoting RDS read replica in $DR_REGION"
aws rds promote-read-replica \
--db-instance-identifier prod-db-dr-replica \
--region "$DR_REGION"
log "Waiting for RDS promotion to complete..."
aws rds wait db-instance-available \
--db-instance-identifier prod-db-dr-replica \
--region "$DR_REGION"
log "RDS promotion complete"
notify "RDS read replica promoted to primary in $DR_REGION"
# Step 2: Scale up application in DR region
log "Step 2: Scaling up application in $DR_REGION"
aws ecs update-service \
--cluster prod-cluster-dr \
--service api-service \
--desired-count 4 \
--region "$DR_REGION"
log "Waiting for ECS service to stabilize..."
aws ecs wait services-stable \
--cluster prod-cluster-dr \
--services api-service \
--region "$DR_REGION"
log "ECS service scaled up and stable"
notify "Application scaled up in $DR_REGION"
# Step 3: Verify health
log "Step 3: Verifying health in $DR_REGION"
for i in $(seq 1 10); do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "https://dr-alb.us-west-2.elb.amazonaws.com/health")
if [ "$STATUS" = "200" ]; then
log "Health check passed (attempt $i)"
break
fi
log "Health check failed (attempt $i, status $STATUS), retrying..."
sleep 10
done
if [ "$STATUS" != "200" ]; then
log "ERROR: Health check failed after 10 attempts"
notify "ALERT: DR health check failing - manual intervention required"
exit 1
fi
# Step 4: Update DNS (if not using automatic Route53 failover)
log "Step 4: DNS failover (Route53 automatic failover should handle this)"
log "Verifying DNS resolution..."
DR_IP=$(dig +short api.example.com)
log "api.example.com resolves to: $DR_IP"
# Step 5: Verify end-to-end
log "Step 5: End-to-end verification"
RESPONSE=$(curl -s "https://api.example.com/health")
log "Health response: $RESPONSE"
log "=== DR Failover Complete ==="
notify "DR failover to $DR_REGION complete. Service restored."
# Generate failover report
cat > "/var/log/dr-failover-report-$(date +%Y%m%d).md" << EOF
# DR Failover Report
- **Date:** $(date -u +%Y-%m-%dT%H:%M:%SZ)
- **Primary Region:** $PRIMARY_REGION
- **DR Region:** $DR_REGION
- **RTO Actual:** Calculate from incident declaration
- **RPO Actual:** Check replication lag at time of incident
- **Status:** Operational in DR region
- **Actions Required:**
- [ ] Monitor error rates and latency
- [ ] Plan failback when primary region is restored
- [ ] Conduct post-incident review
EOF
DR Testing Procedures
dr_test_types:
tabletop_exercise:
frequency: Quarterly
duration: "1-2 hours"
participants: "Engineering, SRE, management, communications"
process:
- Present a disaster scenario (region outage, data corruption, etc.)
- Walk through the response step by step
- Identify gaps in runbooks and communication plans
- Document action items
output: "Tabletop exercise report with findings and action items"
component_failover:
frequency: Monthly
duration: "1-4 hours"
scope: "Individual component failover (database, single service)"
process:
- Select component for testing
- Execute failover procedure from runbook
- Measure actual RTO and RPO
- Execute failback procedure
- Document results
output: "Component test report with measured RTO/RPO"
full_failover:
frequency: Annually
duration: "4-8 hours (scheduled maintenance window)"
scope: "Complete regional failover of all tier 1 and tier 2 services"
process:
1_preparation:
- Schedule maintenance window and notify stakeholders
- Verify DR environment is healthy
- Brief all participating teams
- Set up war room communication channel
2_execute:
- Simulate primary region failure
- Execute failover runbooks for all services
- Record timestamps at each milestone
3_verify:
- Run end-to-end test suite against DR environment
- Verify data consistency
- Check monitoring and alerting in DR region
- Confirm external integrations work
4_failback:
- Restore primary region
- Re-establish replication
- Execute failback to primary
- Verify data consistency post-failback
5_report:
- Document actual RTO and RPO for each service
- Compare against targets
- List all issues encountered
- Create action items for improvements
output: "Full DR test report with measured vs. target metrics"
dr_test_checklist:
before_test:
- [ ] Test plan documented and approved
- [ ] Maintenance window scheduled and communicated
- [ ] All DR runbooks reviewed and updated
- [ ] DR environment health verified
- [ ] Monitoring configured in DR region
- [ ] Communication channel established
- [ ] Rollback plan confirmed
during_test:
- [ ] Timestamps recorded for each step
- [ ] Screenshots captured for evidence
- [ ] Issues logged in real-time
- [ ] Data consistency verified
- [ ] External integrations tested
- [ ] Health checks passing in DR
after_test:
- [ ] Failback completed successfully
- [ ] Primary region replication re-established
- [ ] Data consistency verified post-failback
- [ ] Test report written with metrics
- [ ] Action items created and assigned
- [ ] Runbooks updated based on findings
- [ ] Results presented to management
Terraform DR Infrastructure
# DR region infrastructure
provider "aws" {
alias = "dr"
region = "us-west-2"
}
resource "aws_db_instance" "dr_replica" {
provider = aws.dr
identifier = "prod-db-dr-replica"
replicate_source_db = aws_db_instance.primary.arn
instance_class = "db.r6g.large"
storage_encrypted = true
kms_key_id = aws_kms_key.dr_rds.arn
multi_az = true
deletion_protection = true
skip_final_snapshot = false
tags = {
Purpose = "DR"
Environment = "production"
}
}
resource "aws_route53_health_check" "primary" {
fqdn = "primary-alb.us-east-1.elb.amazonaws.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
enable_sni = true
tags = {
Name = "primary-health-check"
}
}
resource "aws_route53_record" "failover_primary" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "primary"
failover_routing_policy {
type = "PRIMARY"
}
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
health_check_id = aws_route53_health_check.primary.id
}
resource "aws_route53_record" "failover_secondary" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "secondary"
failover_routing_policy {
type = "SECONDARY"
}
alias {
name = aws_lb.dr.dns_name
zone_id = aws_lb.dr.zone_id
evaluate_target_health = true
}
}
DR Compliance Checklist
dr_compliance_checklist:
planning:
- [ ] RTO and RPO targets defined per service tier
- [ ] DR strategy selected based on targets and budget
- [ ] DR architecture documented with diagrams
- [ ] Failover and failback runbooks written
- [ ] Communication plan for DR events documented
- [ ] DR roles and responsibilities assigned
implementation:
- [ ] Cross-region database replication configured
- [ ] Storage replication configured (S3, EBS snapshots)
- [ ] DNS failover routing configured
- [ ] DR region infrastructure provisioned (IaC)
- [ ] Monitoring and alerting configured in DR region
- [ ] Secrets and credentials available in DR region
testing:
- [ ] Tabletop exercises conducted quarterly
- [ ] Component failover tests conducted monthly
- [ ] Full failover test conducted annually
- [ ] Actual RTO/RPO measured and compared to targets
- [ ] Test results documented and reviewed
- [ ] Runbooks updated based on test findings
operational:
- [ ] Replication lag monitored with alerting
- [ ] DR environment health checked regularly
- [ ] Backup integrity verified monthly
- [ ] DR runbooks reviewed and updated quarterly
- [ ] DR test evidence archived for compliance audits
Best Practices
- Define RTO and RPO targets based on business impact analysis, not technical convenience
- Choose the DR strategy that matches your targets and budget: do not over-engineer or under-invest
- Automate failover as much as possible to reduce human error and recovery time
- Test DR procedures regularly at increasing levels of complexity (tabletop, component, full)
- Measure actual RTO and RPO during tests and compare against targets every time
- Include failback procedures in your DR plan: getting back to normal is as important as failing over
- Monitor replication lag continuously and alert when it exceeds RPO thresholds
- Keep DR infrastructure managed by the same IaC as production to prevent configuration drift
- Practice DR in non-emergency conditions so the team is prepared when a real disaster occurs
- Archive DR test results as compliance evidence for SOC 2, HIPAA, and other frameworks
Weekly Installs
34
Repository
bagelhole/devop…t-skillsGitHub Stars
18
First Seen
5 days ago
Security Audits