multi-region-deployment

SKILL.md

Multi-Region Deployment

Comprehensive guide to deploying applications across multiple geographic regions for availability, performance, and disaster recovery.

When to Use This Skill

  • Designing globally distributed applications
  • Implementing disaster recovery (DR)
  • Reducing latency for global users
  • Meeting data residency requirements
  • Achieving high availability (99.99%+)
  • Planning failover strategies

Multi-Region Fundamentals

Why Multi-Region?

Reasons for Multi-Region:

1. High Availability
   └── Survive region-wide failures
   └── Natural disasters, power outages
   └── Target: 99.99%+ uptime

2. Low Latency
   └── Serve users from nearest region
   └── Reduce round-trip time
   └── Better user experience

3. Data Residency
   └── GDPR, data sovereignty laws
   └── Keep data in specific countries
   └── Compliance requirements

4. Disaster Recovery
   └── Business continuity
   └── RTO/RPO requirements
   └── Regulatory requirements

Trade-offs:
+ Higher availability
+ Lower latency globally
+ Compliance capability
- Higher cost (2x-3x or more)
- Increased complexity
- Data consistency challenges

Deployment Models

Model 1: Active-Passive (DR)
┌─────────────────┐         ┌─────────────────┐
│  PRIMARY (Active)│         │ SECONDARY (Passive)│
│  ┌─────────────┐│         │  ┌─────────────┐│
│  │     App     ││   ──►   │  │    App      ││
│  │    (Live)   ││  Sync   │  │  (Standby)  ││
│  └─────────────┘│         │  └─────────────┘│
│  ┌─────────────┐│         │  ┌─────────────┐│
│  │     DB      ││   ──►   │  │    DB       ││
│  │  (Primary)  ││  Replic │  │  (Replica)  ││
│  └─────────────┘│         │  └─────────────┘│
└─────────────────┘         └─────────────────┘
    All traffic              Failover only

Model 2: Active-Active (Load Distributed)
┌─────────────────┐         ┌─────────────────┐
│   REGION A       │         │   REGION B       │
│  ┌─────────────┐│    ◄──► │  ┌─────────────┐│
│  │     App     ││  Users  │  │     App     ││
│  │   (Active)  ││  routed │  │   (Active)  ││
│  └─────────────┘│   by    │  └─────────────┘│
│  ┌─────────────┐│ location│  ┌─────────────┐│
│  │     DB      ││    ◄──► │  │     DB      ││
│  │  (Primary)  ││  Replic │  │  (Primary)  ││
│  └─────────────┘│   Both  │  └─────────────┘│
└─────────────────┘  ways   └─────────────────┘
  Serves Region A           Serves Region B

Model 3: Active-Active-Active (Global)
┌──────┐    ┌──────┐    ┌──────┐
│  US  │◄──►│  EU  │◄──►│ APAC │
│Active│    │Active│    │Active│
└──┬───┘    └──┬───┘    └──┬───┘
   │           │           │
   └───────────┼───────────┘
         Global Load Balancer
         routes by location

Region Selection

Selection Criteria

Region Selection Factors:

1. User Location
   □ Where are your users?
   □ Latency requirements per region?
   □ User concentration (80/20 rule)?

2. Compliance Requirements
   □ Data residency laws (GDPR, etc.)
   □ Government regulations
   □ Industry requirements (HIPAA, PCI)

3. Cloud Provider Availability
   □ Not all services in all regions
   □ Service feature parity
   □ Regional pricing differences

4. Network Connectivity
   □ Internet exchange points
   □ Direct connect options
   □ Cross-region latency

5. Disaster Risk
   □ Natural disaster patterns
   □ Political stability
   □ Infrastructure reliability

6. Cost
   □ Compute/storage pricing varies
   □ Data transfer costs (egress)
   □ Support availability

Common Region Pairs

Region Pair Strategy:

Americas:
- Primary: US East (N. Virginia)
- Secondary: US West (Oregon) or US East (Ohio)
- Distance: 2,500-3,000 km
- Latency: ~60ms

Europe:
- Primary: EU West (Ireland)
- Secondary: EU Central (Frankfurt) or EU West (London)
- Distance: ~1,000-1,500 km
- Latency: ~20-30ms

Asia Pacific:
- Primary: Singapore or Tokyo
- Secondary: Sydney or Mumbai
- Distance: 5,000-7,000 km
- Latency: ~100-150ms

Global Triad:
- US East + EU West + Singapore/Tokyo
- Covers most global users
- <100ms to 80%+ of users

Data Replication

Replication Patterns

Pattern 1: Async Replication (Most Common)
Primary ──────► Replica
         lag:
         ms to seconds

+ Lower latency for writes
+ Primary not blocked by replica
- Potential data loss on failover (RPO > 0)
- Replication lag visible

Pattern 2: Sync Replication
Primary ◄─────► Replica
         both
         confirm

+ No data loss on failover (RPO = 0)
+ Strong consistency
- Higher write latency
- Availability coupled to both regions

Pattern 3: Semi-Sync Replication
Primary ──────► At least 1 Replica (sync)
        └────► Other Replicas (async)

+ Guaranteed durability for some replicas
+ Balance of latency and durability
- More complex failure handling

Conflict Resolution

Multi-Primary Conflict Resolution:

Scenario: Same record updated in two regions simultaneously

Resolution Strategies:

1. Last Write Wins (LWW)
   └── Timestamp-based
   └── Simple but can lose data
   └── Clock sync important

2. First Write Wins
   └── First committed wins
   └── Later writes rejected or queued
   └── Good for "create once" data

3. Application-Level Resolution
   └── Custom merge logic
   └── Most flexible
   └── Most complex

4. CRDTs (Conflict-free Replicated Data Types)
   └── Mathematically guaranteed convergence
   └── Counters, sets, maps
   └── Good for specific use cases

Best Practice:
- Design to avoid conflicts where possible
- Partition data by region when appropriate
- Use single-primary for conflict-sensitive data

Failover Strategies

Failover Types

Failover Types:

1. DNS-Based Failover
   ┌─────────────────────────────────────────┐
   │  DNS Health Check                       │
   │  ├── Check primary every 10-30s        │
   │  ├── 3 consecutive failures = unhealthy│
   │  └── Update DNS to point to secondary  │
   └─────────────────────────────────────────┘

   RTO: 60-300 seconds (DNS TTL + propagation)
   Pros: Simple, works with any app
   Cons: Slow failover, DNS caching issues

2. Load Balancer Failover
   ┌─────────────────────────────────────────┐
   │  Global Load Balancer                   │
   │  ├── Continuous health checks          │
   │  ├── Instant routing changes           │
   │  └── No DNS propagation wait           │
   └─────────────────────────────────────────┘

   RTO: 10-60 seconds
   Pros: Fast, reliable
   Cons: Requires GLB, potential single point

3. Application-Level Failover
   ┌─────────────────────────────────────────┐
   │  Client/App Aware                       │
   │  ├── Client retries to alternate region│
   │  ├── SDK handles failover              │
   │  └── No infrastructure dependency      │
   └─────────────────────────────────────────┘

   RTO: 1-10 seconds
   Pros: Fastest, most control
   Cons: Requires client changes

RTO and RPO

Recovery Objectives:

RTO (Recovery Time Objective):
└── Maximum acceptable downtime
└── Time from failure to recovery
└── Drives failover automation investment

RPO (Recovery Point Objective):
└── Maximum acceptable data loss
└── Time between last backup and failure
└── Drives replication strategy

Common Targets:
┌──────────────┬──────────┬──────────┬───────────────────┐
│ Tier         │ RTO      │ RPO      │ Strategy          │
├──────────────┼──────────┼──────────┼───────────────────┤
│ Critical     │ <1 min   │ 0        │ Active-active     │
│              │          │          │ Sync replication  │
├──────────────┼──────────┼──────────┼───────────────────┤
│ High         │ <15 min  │ <1 min   │ Active-passive    │
│              │          │          │ Hot standby       │
├──────────────┼──────────┼──────────┼───────────────────┤
│ Medium       │ <4 hours │ <1 hour  │ Warm standby      │
│              │          │          │ Async replication │
├──────────────┼──────────┼──────────┼───────────────────┤
│ Low          │ <24 hours│ <24 hours│ Backup/Restore    │
│              │          │          │ Pilot light       │
└──────────────┴──────────┴──────────┴───────────────────┘

Traffic Routing

Global Load Balancing

GLB Routing Policies:

1. Geolocation Routing
   └── Route by user's geographic location
   └── Europe users → EU region
   └── Fallback for unmapped locations

2. Latency-Based Routing
   └── Route to lowest latency region
   └── Based on real measurements
   └── Adapts to network conditions

3. Weighted Routing
   └── Split traffic by percentage
   └── Good for rollouts, testing
   └── Example: 90% primary, 10% secondary

4. Failover Routing
   └── Primary region until unhealthy
   └── Automatic switch to secondary
   └── Health check driven

Cloud Implementations:
- AWS: Route 53, Global Accelerator
- Azure: Traffic Manager, Front Door
- GCP: Cloud Load Balancing
- Cloudflare: Load Balancing

Session Handling

Session Affinity in Multi-Region:

Challenge: User session state across regions

Option 1: Sticky Sessions
└── User stays in same region for session
└── Failover loses session
└── Simple but limited DR

Option 2: Centralized Session Store
└── Session in Redis/database
└── All regions access same store
└── Adds latency, single point of failure

Option 3: Distributed Session Store
└── Redis Cluster across regions
└── Session replicated
└── Complex but resilient

Option 4: Stateless (JWT/Token)
└── Session in client-side token
└── No server-side state
└── Best for multi-region

Recommendation:
- Prefer stateless where possible
- If stateful, use distributed store
- Design for session loss on failover

Database Patterns

Database Deployment Options

Option 1: Single Primary + Read Replicas
┌───────────────┐         ┌───────────────┐
│   US-EAST     │         │   EU-WEST     │
│  ┌─────────┐  │   ───►  │  ┌─────────┐  │
│  │ Primary │  │  Async  │  │ Replica │  │
│  │  (R/W)  │  │  Replic │  │  (Read) │  │
│  └─────────┘  │         │  └─────────┘  │
└───────────────┘         └───────────────┘
- Writes go to primary region
- Reads served locally
- Failover promotes replica

Option 2: Multi-Primary (Active-Active)
┌───────────────┐         ┌───────────────┐
│   US-EAST     │◄───────►│   EU-WEST     │
│  ┌─────────┐  │  Bi-dir │  ┌─────────┐  │
│  │ Primary │  │  Replic │  │ Primary │  │
│  │  (R/W)  │  │         │  │  (R/W)  │  │
│  └─────────┘  │         │  └─────────┘  │
└───────────────┘         └───────────────┘
- Writes accepted in both regions
- Conflict resolution required
- Complex but lowest latency

Option 3: Globally Distributed Database
┌─────────────────────────────────────────┐
│  CockroachDB / Spanner / YugabyteDB    │
│  ┌─────┐    ┌─────┐    ┌─────┐        │
│  │ US  │────│ EU  │────│ APAC│        │
│  └─────┘    └─────┘    └─────┘        │
│  Automatic sharding and replication    │
└─────────────────────────────────────────┘
- Database handles distribution
- Strong consistency available
- Higher latency for writes

Testing and Validation

Chaos Engineering for Multi-Region

Multi-Region Chaos Tests:

1. Region Failover Test
   □ Fail primary region completely
   □ Measure failover time
   □ Verify data integrity
   □ Test user experience

2. Network Partition Test
   □ Block inter-region communication
   □ Verify split-brain handling
   □ Test conflict resolution

3. Partial Failure Test
   □ Fail subset of services in region
   □ Test degraded operation
   □ Verify monitoring/alerting

4. Data Replication Lag Test
   □ Introduce artificial lag
   □ Test application behavior
   □ Verify consistency expectations

5. Failback Test
   □ Restore failed region
   □ Test data sync
   □ Test traffic redistribution

Schedule:
- Failover tests: Monthly
- Full DR drill: Quarterly
- Chaos experiments: Weekly

Best Practices

Multi-Region Best Practices:

1. Design for Failure
   □ Assume any region can fail
   □ No single points of failure
   □ Automated failover
   □ Regular testing

2. Data Strategy
   □ Define consistency requirements
   □ Choose appropriate replication
   □ Plan for conflicts
   □ Consider data residency

3. Observability
   □ Cross-region metrics
   □ Distributed tracing
   □ Centralized logging
   □ Region-aware alerting

4. Cost Management
   □ Right-size standby resources
   □ Use reserved capacity wisely
   □ Monitor data transfer costs
   □ Consider traffic patterns

5. Operational Readiness
   □ Runbooks for failover
   □ Regular DR drills
   □ On-call training
   □ Post-incident reviews

Related Skills

  • latency-optimization - Reducing global latency
  • distributed-consensus - Consistency patterns
  • cdn-architecture - Edge caching for multi-region
  • chaos-engineering-fundamentals - Testing resilience
Weekly Installs
3
Installed on
windsurf2
trae2
opencode2
codex2
claude-code2
antigravity2