Scalability Playbook

Systematic approach to identifying and resolving scalability bottlenecks.

Bottleneck Analysis

Current System Profile

Traffic: 1,000 req/min
Users: 10,000 active
Data: 100GB database
Response time: p95 = 500ms

Identified Bottlenecks

1. Database Queries

Symptom: Slow page loads (2-3s) Measurement: Query time p95 = 800ms Impact: HIGH - affects all reads Trigger: When p95 >500ms

2. Single Server

Symptom: High CPU (>80%) Measurement: Load average >4 Impact: MEDIUM - intermittent slowdowns Trigger: When CPU >70%

3. No Caching

Symptom: Repeated DB queries Measurement: Cache hit rate = 0% Impact: MEDIUM - unnecessary load Trigger: When query volume >10k/min

Scaling Strategies (Ordered)

Level 1: Quick Wins (Days)

1.1 Add Database Indexes

Problem: Slow queries Solution:

CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at);

Expected Impact: 80% faster queries Cost: $0 Effort: 1 day

1.2 Enable Query Caching

Problem: Repeated queries Solution: Redis cache layer

const cached = await redis.get(`user:${userId}`);
if (cached) return JSON.parse(cached);

const user = await db.users.findById(userId);
await redis.setex(`user:${userId}`, 3600, JSON.stringify(user));

Expected Impact: 60% reduction in DB load Cost: $50/month Effort: 2 days

Level 2: Horizontal Scaling (Weeks)

2.1 Add Read Replicas

Problem: Read-heavy workload Solution: Route reads to replicas

Write Load: Primary DB
Read Load: 3x Read Replicas

Expected Impact: 3x read capacity Cost: $300/month Effort: 1 week

2.2 Load Balancer + Multiple Servers

Problem: Single point of failure Solution:

ALB
 ├── Server 1
 ├── Server 2
 └── Server 3

Expected Impact: 3x throughput Cost: $400/month Effort: 1 week

Level 3: Architecture Changes (Months)

3.1 CDN for Static Assets

Problem: Slow asset delivery Solution: CloudFront CDN Expected Impact: 90% faster asset loads Cost: $100/month Effort: 1 week

3.2 Async Processing

Problem: Slow sync operations Solution: Background job queues

// Before: Sync
await sendEmail(user);
await processPayment(order);
await updateAnalytics(event);
return response; // Waits 5+ seconds

// After: Async
await queue.add("send-email", { userId });
await queue.add("process-payment", { orderId });
await queue.add("update-analytics", { event });
return response; // Returns immediately

Expected Impact: 80% faster responses Cost: $50/month (SQS) Effort: 2 weeks

Level 4: Data Layer Optimization (Months)

4.1 Database Sharding

Problem: Single DB too large Solution: Shard by user_id

Shard 1: user_id 0-24999
Shard 2: user_id 25000-49999
Shard 3: user_id 50000-74999
Shard 4: user_id 75000-99999

Expected Impact: 4x capacity Cost: $1,200/month Effort: 2 months

4.2 Event-Driven Architecture

Problem: Tight coupling, cascading failures Solution: Message broker (Kafka)

Service A → Kafka → Service B
          ↘        ↗ Service C

Expected Impact: Better isolation, resilience Cost: $500/month Effort: 3 months

Scaling Triggers

| Metric           | Current | Warning | Critical | Action                  |
| ---------------- | ------- | ------- | -------- | ----------------------- |
| CPU              | 40%     | 70%     | 85%      | Add servers             |
| Memory           | 50%     | 75%     | 90%      | Upgrade instances       |
| DB Connections   | 20      | 40      | 50       | Add read replicas       |
| Query Time (p95) | 200ms   | 500ms   | 1000ms   | Add indexes             |
| Queue Depth      | 100     | 1000    | 5000     | Add workers             |
| Error Rate       | 0.1%    | 1%      | 5%       | Investigate immediately |

Phased Scaling Plan

Phase 1: Current → 10x (0-3 months)

Target: 10,000 req/min, 100K users

Actions:

Add database indexes (Week 1)
Implement Redis caching (Week 2)
Add 3x read replicas (Week 4)
Horizontal scale app servers (Week 6)
CDN for static assets (Week 8)

Cost: $500 → $1,000/month

Phase 2: 10x → 100x (3-12 months)

Target: 100,000 req/min, 1M users

Actions:

Database sharding (Month 4-6)
Multi-region deployment (Month 6-8)
Microservices extraction (Month 8-12)
Event-driven architecture (Month 10-12)

Cost: $1,000 → $10,000/month

Phase 3: 100x → 1000x (12-24 months)

Target: 1M req/min, 10M users

Actions:

Global CDN (Month 13)
Advanced caching (L1/L2) (Month 14-15)
Custom DB solutions (Month 16-18)
Edge computing (Month 18-20)

Cost: $10,000 → $100,000/month

Load Testing Plan

# Current baseline
hey -n 10000 -c 100 https://api.example.com/users

# Target 10x
hey -n 100000 -c 1000 https://api.example.com/users

# Measure:
# - Requests/sec
# - p50, p95, p99 latency
# - Error rate
# - Resource utilization

Cost-Benefit Analysis

| Strategy      | Cost/Month | Expected Impact    | ROI | Priority |
| ------------- | ---------- | ------------------ | --- | -------- |
| DB Indexes    | $0         | 80% faster queries | ∞   | HIGH     |
| Redis Cache   | $50        | 60% less DB load   | 12x | HIGH     |
| Read Replicas | $300       | 3x capacity        | 10x | MEDIUM   |
| Load Balancer | $400       | 3x throughput      | 7x  | MEDIUM   |
| DB Sharding   | $1,200     | 4x capacity        | 3x  | LOW      |

Best Practices

Measure first: Don't optimize blindly
Low-hanging fruit: Start with easy wins
Load test: Validate before production
Monitor continuously: Set up alerts
Plan ahead: Scale before hitting limits
Cost-conscious: ROI-driven decisions
Incremental: Small, safe changes

scalability-playbook

Scalability Playbook

Bottleneck Analysis

Current System Profile

Identified Bottlenecks

1. Database Queries

2. Single Server

3. No Caching

Scaling Strategies (Ordered)

Level 1: Quick Wins (Days)

1.1 Add Database Indexes

1.2 Enable Query Caching

Level 2: Horizontal Scaling (Weeks)

2.1 Add Read Replicas

2.2 Load Balancer + Multiple Servers

Level 3: Architecture Changes (Months)

3.1 CDN for Static Assets

3.2 Async Processing

Level 4: Data Layer Optimization (Months)

4.1 Database Sharding

4.2 Event-Driven Architecture

Scaling Triggers

Phased Scaling Plan

Phase 1: Current → 10x (0-3 months)

Phase 2: 10x → 100x (3-12 months)

Phase 3: 100x → 1000x (12-24 months)

Load Testing Plan

Cost-Benefit Analysis

Best Practices

Output Checklist