disaster-recovery

Installation
SKILL.md

Disaster Recovery

Implement disaster recovery strategies including RTO/RPO planning, AWS cross-region failover patterns, DR testing procedures, and automated failover scripts.

When to Use

  • Defining RTO and RPO targets for critical systems
  • Designing multi-region or multi-cloud disaster recovery architectures
  • Implementing automated failover and failback procedures
  • Conducting DR tests (tabletop, component, full failover)
  • Meeting compliance requirements for contingency planning (SOC 2, HIPAA, FedRAMP, ISO 27001)

RTO/RPO Planning

recovery_metrics:
  RTO:
    definition: "Recovery Time Objective - maximum acceptable downtime"
    measurement: "From incident declaration to service restoration"
    factors:
      - Failover automation maturity
      - Data replication lag
      - DNS propagation time
      - Application warm-up time
      - Verification procedures

  RPO:
    definition: "Recovery Point Objective - maximum acceptable data loss"
    measurement: "Time gap between last good backup and the incident"
    factors:
      - Backup frequency
      - Replication method (sync vs. async)
      - Transaction log shipping interval
      - Cross-region replication lag

service_tier_targets:
  tier_1_critical:
    examples: "Authentication, payment processing, core API"
    rto: "< 15 minutes"
    rpo: "< 1 minute (near-zero)"
    strategy: "Multi-site active-active or warm standby"
    replication: "Synchronous or near-synchronous"
    testing: "Quarterly failover test"

  tier_2_essential:
    examples: "Customer dashboards, reporting, notifications"
    rto: "< 1 hour"
    rpo: "< 15 minutes"
    strategy: "Warm standby or pilot light"
    replication: "Asynchronous with short interval"
    testing: "Semi-annual failover test"

  tier_3_standard:
    examples: "Internal tools, analytics, batch processing"
    rto: "< 4 hours"
    rpo: "< 1 hour"
    strategy: "Pilot light or backup and restore"
    replication: "Periodic snapshots"
    testing: "Annual failover test"

  tier_4_non_essential:
    examples: "Development environments, documentation sites"
    rto: "< 24 hours"
    rpo: "< 24 hours"
    strategy: "Backup and restore"
    replication: "Daily backups"
    testing: "Annual backup restore verification"

DR Strategies Comparison

strategies:
  backup_and_restore:
    rto: "Hours"
    rpo: "Hours (depends on backup frequency)"
    cost: "$"
    description: "Regular backups stored in DR region. Restore from backup when needed."
    aws_services:
      - "S3 cross-region replication for backups"
      - "RDS automated snapshots copied to DR region"
      - "AMI copies in DR region"
      - "Terraform/CloudFormation for infrastructure rebuild"
    pros: "Lowest cost, simplest to maintain"
    cons: "Longest recovery time, highest data loss potential"

  pilot_light:
    rto: "Minutes to hours"
    rpo: "Minutes"
    cost: "$$"
    description: "Core infrastructure running in DR region (databases replicated). Scale up compute on failover."
    aws_services:
      - "RDS cross-region read replica (always running)"
      - "S3 cross-region replication"
      - "AMIs pre-built in DR region"
      - "Auto Scaling groups at zero/minimal capacity"
    pros: "Fast database recovery, moderate cost"
    cons: "Compute scale-up adds to recovery time"

  warm_standby:
    rto: "Minutes"
    rpo: "Seconds to minutes"
    cost: "$$$"
    description: "Scaled-down but functional environment in DR region. Scale up on failover."
    aws_services:
      - "RDS cross-region read replica"
      - "ECS/EKS running at reduced capacity"
      - "Route53 health checks for automated DNS failover"
      - "Global Accelerator for traffic management"
    pros: "Fast failover, reduced risk"
    cons: "Higher baseline cost for idle resources"

  multi_site_active:
    rto: "Near-zero"
    rpo: "Near-zero"
    cost: "$$$$"
    description: "Active-active across regions. Traffic served from both regions simultaneously."
    aws_services:
      - "DynamoDB Global Tables or Aurora Global Database"
      - "Route53 latency/weighted routing"
      - "CloudFront with multi-origin"
      - "Global Accelerator"
      - "ECS/EKS in both regions"
    pros: "Minimal downtime and data loss"
    cons: "Highest cost, most complex to operate"

AWS Cross-Region DR Implementation

# === Database Replication ===

# Create cross-region RDS read replica
aws rds create-db-instance-read-replica \
  --db-instance-identifier prod-db-dr-replica \
  --source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:prod-db \
  --db-instance-class db.r6g.large \
  --region us-west-2 \
  --kms-key-id arn:aws:kms:us-west-2:123456789012:alias/rds-dr-key \
  --multi-az \
  --tags Key=Purpose,Value=DR Key=Environment,Value=production

# Create Aurora Global Database for near-zero RPO
aws rds create-global-cluster \
  --global-cluster-identifier prod-global-db \
  --source-db-cluster-identifier arn:aws:rds:us-east-1:123456789012:cluster:prod-aurora-cluster \
  --region us-east-1

# Add secondary region to Aurora Global Database
aws rds create-db-cluster \
  --db-cluster-identifier prod-aurora-dr \
  --global-cluster-identifier prod-global-db \
  --engine aurora-postgresql \
  --region us-west-2 \
  --kms-key-id arn:aws:kms:us-west-2:123456789012:alias/aurora-dr-key

# === Storage Replication ===

# S3 cross-region replication
cat > /tmp/replication-config.json << 'EOF'
{
  "Role": "arn:aws:iam::123456789012:role/s3-replication-role",
  "Rules": [
    {
      "ID": "ReplicateAll",
      "Status": "Enabled",
      "Filter": {"Prefix": ""},
      "Destination": {
        "Bucket": "arn:aws:s3:::prod-data-dr-usw2",
        "StorageClass": "STANDARD",
        "EncryptionConfiguration": {
          "ReplicaKmsKeyID": "arn:aws:kms:us-west-2:123456789012:alias/s3-dr-key"
        }
      },
      "DeleteMarkerReplication": {"Status": "Enabled"}
    }
  ]
}
EOF

aws s3api put-bucket-replication \
  --bucket prod-data-use1 \
  --replication-configuration file:///tmp/replication-config.json

# === DNS Failover ===

# Route53 health check for primary region
aws route53 create-health-check --caller-reference "prod-health-$(date +%s)" \
  --health-check-config '{
    "Type": "HTTPS",
    "FullyQualifiedDomainName": "api.example.com",
    "Port": 443,
    "ResourcePath": "/health",
    "RequestInterval": 10,
    "FailureThreshold": 3,
    "EnableSNI": true
  }'

# Configure failover routing
aws route53 change-resource-record-sets --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [
      {
        "Action": "UPSERT",
        "ResourceRecordSet": {
          "Name": "api.example.com",
          "Type": "A",
          "SetIdentifier": "primary",
          "Failover": "PRIMARY",
          "AliasTarget": {
            "HostedZoneId": "Z1234PRIMARY",
            "DNSName": "primary-alb.us-east-1.elb.amazonaws.com",
            "EvaluateTargetHealth": true
          },
          "HealthCheckId": "health-check-id-primary"
        }
      },
      {
        "Action": "UPSERT",
        "ResourceRecordSet": {
          "Name": "api.example.com",
          "Type": "A",
          "SetIdentifier": "secondary",
          "Failover": "SECONDARY",
          "AliasTarget": {
            "HostedZoneId": "Z5678SECONDARY",
            "DNSName": "dr-alb.us-west-2.elb.amazonaws.com",
            "EvaluateTargetHealth": true
          }
        }
      }
    ]
  }'

Failover Script

#!/usr/bin/env bash
# dr-failover.sh - Execute disaster recovery failover to DR region
set -euo pipefail

DR_REGION="us-west-2"
PRIMARY_REGION="us-east-1"
SLACK_WEBHOOK="${DR_SLACK_WEBHOOK}"
LOG_FILE="/var/log/dr-failover-$(date +%Y%m%d-%H%M%S).log"

log() {
  echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] $1" | tee -a "$LOG_FILE"
}

notify() {
  curl -s -X POST "$SLACK_WEBHOOK" \
    -H "Content-Type: application/json" \
    -d "{\"text\":\"DR FAILOVER: $1\"}" > /dev/null
}

log "=== DR Failover Initiated ==="
notify "DR failover initiated to $DR_REGION"

# Step 1: Promote RDS read replica
log "Step 1: Promoting RDS read replica in $DR_REGION"
aws rds promote-read-replica \
  --db-instance-identifier prod-db-dr-replica \
  --region "$DR_REGION"
log "Waiting for RDS promotion to complete..."
aws rds wait db-instance-available \
  --db-instance-identifier prod-db-dr-replica \
  --region "$DR_REGION"
log "RDS promotion complete"
notify "RDS read replica promoted to primary in $DR_REGION"

# Step 2: Scale up application in DR region
log "Step 2: Scaling up application in $DR_REGION"
aws ecs update-service \
  --cluster prod-cluster-dr \
  --service api-service \
  --desired-count 4 \
  --region "$DR_REGION"
log "Waiting for ECS service to stabilize..."
aws ecs wait services-stable \
  --cluster prod-cluster-dr \
  --services api-service \
  --region "$DR_REGION"
log "ECS service scaled up and stable"
notify "Application scaled up in $DR_REGION"

# Step 3: Verify health
log "Step 3: Verifying health in $DR_REGION"
for i in $(seq 1 10); do
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" "https://dr-alb.us-west-2.elb.amazonaws.com/health")
  if [ "$STATUS" = "200" ]; then
    log "Health check passed (attempt $i)"
    break
  fi
  log "Health check failed (attempt $i, status $STATUS), retrying..."
  sleep 10
done

if [ "$STATUS" != "200" ]; then
  log "ERROR: Health check failed after 10 attempts"
  notify "ALERT: DR health check failing - manual intervention required"
  exit 1
fi

# Step 4: Update DNS (if not using automatic Route53 failover)
log "Step 4: DNS failover (Route53 automatic failover should handle this)"
log "Verifying DNS resolution..."
DR_IP=$(dig +short api.example.com)
log "api.example.com resolves to: $DR_IP"

# Step 5: Verify end-to-end
log "Step 5: End-to-end verification"
RESPONSE=$(curl -s "https://api.example.com/health")
log "Health response: $RESPONSE"

log "=== DR Failover Complete ==="
notify "DR failover to $DR_REGION complete. Service restored."

# Generate failover report
cat > "/var/log/dr-failover-report-$(date +%Y%m%d).md" << EOF
# DR Failover Report
- **Date:** $(date -u +%Y-%m-%dT%H:%M:%SZ)
- **Primary Region:** $PRIMARY_REGION
- **DR Region:** $DR_REGION
- **RTO Actual:** Calculate from incident declaration
- **RPO Actual:** Check replication lag at time of incident
- **Status:** Operational in DR region
- **Actions Required:**
  - [ ] Monitor error rates and latency
  - [ ] Plan failback when primary region is restored
  - [ ] Conduct post-incident review
EOF

DR Testing Procedures

dr_test_types:
  tabletop_exercise:
    frequency: Quarterly
    duration: "1-2 hours"
    participants: "Engineering, SRE, management, communications"
    process:
      - Present a disaster scenario (region outage, data corruption, etc.)
      - Walk through the response step by step
      - Identify gaps in runbooks and communication plans
      - Document action items
    output: "Tabletop exercise report with findings and action items"

  component_failover:
    frequency: Monthly
    duration: "1-4 hours"
    scope: "Individual component failover (database, single service)"
    process:
      - Select component for testing
      - Execute failover procedure from runbook
      - Measure actual RTO and RPO
      - Execute failback procedure
      - Document results
    output: "Component test report with measured RTO/RPO"

  full_failover:
    frequency: Annually
    duration: "4-8 hours (scheduled maintenance window)"
    scope: "Complete regional failover of all tier 1 and tier 2 services"
    process:
      1_preparation:
        - Schedule maintenance window and notify stakeholders
        - Verify DR environment is healthy
        - Brief all participating teams
        - Set up war room communication channel
      2_execute:
        - Simulate primary region failure
        - Execute failover runbooks for all services
        - Record timestamps at each milestone
      3_verify:
        - Run end-to-end test suite against DR environment
        - Verify data consistency
        - Check monitoring and alerting in DR region
        - Confirm external integrations work
      4_failback:
        - Restore primary region
        - Re-establish replication
        - Execute failback to primary
        - Verify data consistency post-failback
      5_report:
        - Document actual RTO and RPO for each service
        - Compare against targets
        - List all issues encountered
        - Create action items for improvements
    output: "Full DR test report with measured vs. target metrics"

dr_test_checklist:
  before_test:
    - [ ] Test plan documented and approved
    - [ ] Maintenance window scheduled and communicated
    - [ ] All DR runbooks reviewed and updated
    - [ ] DR environment health verified
    - [ ] Monitoring configured in DR region
    - [ ] Communication channel established
    - [ ] Rollback plan confirmed

  during_test:
    - [ ] Timestamps recorded for each step
    - [ ] Screenshots captured for evidence
    - [ ] Issues logged in real-time
    - [ ] Data consistency verified
    - [ ] External integrations tested
    - [ ] Health checks passing in DR

  after_test:
    - [ ] Failback completed successfully
    - [ ] Primary region replication re-established
    - [ ] Data consistency verified post-failback
    - [ ] Test report written with metrics
    - [ ] Action items created and assigned
    - [ ] Runbooks updated based on findings
    - [ ] Results presented to management

Terraform DR Infrastructure

# DR region infrastructure
provider "aws" {
  alias  = "dr"
  region = "us-west-2"
}

resource "aws_db_instance" "dr_replica" {
  provider               = aws.dr
  identifier             = "prod-db-dr-replica"
  replicate_source_db    = aws_db_instance.primary.arn
  instance_class         = "db.r6g.large"
  storage_encrypted      = true
  kms_key_id             = aws_kms_key.dr_rds.arn
  multi_az               = true
  deletion_protection    = true
  skip_final_snapshot    = false

  tags = {
    Purpose     = "DR"
    Environment = "production"
  }
}

resource "aws_route53_health_check" "primary" {
  fqdn              = "primary-alb.us-east-1.elb.amazonaws.com"
  port               = 443
  type               = "HTTPS"
  resource_path      = "/health"
  failure_threshold  = 3
  request_interval   = 10
  enable_sni         = true

  tags = {
    Name = "primary-health-check"
  }
}

resource "aws_route53_record" "failover_primary" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "primary"

  failover_routing_policy {
    type = "PRIMARY"
  }

  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }

  health_check_id = aws_route53_health_check.primary.id
}

resource "aws_route53_record" "failover_secondary" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "secondary"

  failover_routing_policy {
    type = "SECONDARY"
  }

  alias {
    name                   = aws_lb.dr.dns_name
    zone_id                = aws_lb.dr.zone_id
    evaluate_target_health = true
  }
}

DR Compliance Checklist

dr_compliance_checklist:
  planning:
    - [ ] RTO and RPO targets defined per service tier
    - [ ] DR strategy selected based on targets and budget
    - [ ] DR architecture documented with diagrams
    - [ ] Failover and failback runbooks written
    - [ ] Communication plan for DR events documented
    - [ ] DR roles and responsibilities assigned

  implementation:
    - [ ] Cross-region database replication configured
    - [ ] Storage replication configured (S3, EBS snapshots)
    - [ ] DNS failover routing configured
    - [ ] DR region infrastructure provisioned (IaC)
    - [ ] Monitoring and alerting configured in DR region
    - [ ] Secrets and credentials available in DR region

  testing:
    - [ ] Tabletop exercises conducted quarterly
    - [ ] Component failover tests conducted monthly
    - [ ] Full failover test conducted annually
    - [ ] Actual RTO/RPO measured and compared to targets
    - [ ] Test results documented and reviewed
    - [ ] Runbooks updated based on test findings

  operational:
    - [ ] Replication lag monitored with alerting
    - [ ] DR environment health checked regularly
    - [ ] Backup integrity verified monthly
    - [ ] DR runbooks reviewed and updated quarterly
    - [ ] DR test evidence archived for compliance audits

Best Practices

  • Define RTO and RPO targets based on business impact analysis, not technical convenience
  • Choose the DR strategy that matches your targets and budget: do not over-engineer or under-invest
  • Automate failover as much as possible to reduce human error and recovery time
  • Test DR procedures regularly at increasing levels of complexity (tabletop, component, full)
  • Measure actual RTO and RPO during tests and compare against targets every time
  • Include failback procedures in your DR plan: getting back to normal is as important as failing over
  • Monitor replication lag continuously and alert when it exceeds RPO thresholds
  • Keep DR infrastructure managed by the same IaC as production to prevent configuration drift
  • Practice DR in non-emergency conditions so the team is prepared when a real disaster occurs
  • Archive DR test results as compliance evidence for SOC 2, HIPAA, and other frameworks
Weekly Installs
34
GitHub Stars
18
First Seen
5 days ago