scalability-playbook
Scalability Playbook
Systematic approach to identifying and resolving scalability bottlenecks.
Bottleneck Analysis
Current System Profile
Traffic: 1,000 req/min
Users: 10,000 active
Data: 100GB database
Response time: p95 = 500ms
Identified Bottlenecks
1. Database Queries
Symptom: Slow page loads (2-3s) Measurement: Query time p95 = 800ms Impact: HIGH - affects all reads Trigger: When p95 >500ms
2. Single Server
Symptom: High CPU (>80%) Measurement: Load average >4 Impact: MEDIUM - intermittent slowdowns Trigger: When CPU >70%
3. No Caching
Symptom: Repeated DB queries Measurement: Cache hit rate = 0% Impact: MEDIUM - unnecessary load Trigger: When query volume >10k/min
Scaling Strategies (Ordered)
Level 1: Quick Wins (Days)
1.1 Add Database Indexes
Problem: Slow queries Solution:
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at);
Expected Impact: 80% faster queries Cost: $0 Effort: 1 day
1.2 Enable Query Caching
Problem: Repeated queries Solution: Redis cache layer
const cached = await redis.get(`user:${userId}`);
if (cached) return JSON.parse(cached);
const user = await db.users.findById(userId);
await redis.setex(`user:${userId}`, 3600, JSON.stringify(user));
Expected Impact: 60% reduction in DB load Cost: $50/month Effort: 2 days
Level 2: Horizontal Scaling (Weeks)
2.1 Add Read Replicas
Problem: Read-heavy workload Solution: Route reads to replicas
Write Load: Primary DB
Read Load: 3x Read Replicas
Expected Impact: 3x read capacity Cost: $300/month Effort: 1 week
2.2 Load Balancer + Multiple Servers
Problem: Single point of failure Solution:
ALB
├── Server 1
├── Server 2
└── Server 3
Expected Impact: 3x throughput Cost: $400/month Effort: 1 week
Level 3: Architecture Changes (Months)
3.1 CDN for Static Assets
Problem: Slow asset delivery Solution: CloudFront CDN Expected Impact: 90% faster asset loads Cost: $100/month Effort: 1 week
3.2 Async Processing
Problem: Slow sync operations Solution: Background job queues
// Before: Sync
await sendEmail(user);
await processPayment(order);
await updateAnalytics(event);
return response; // Waits 5+ seconds
// After: Async
await queue.add("send-email", { userId });
await queue.add("process-payment", { orderId });
await queue.add("update-analytics", { event });
return response; // Returns immediately
Expected Impact: 80% faster responses Cost: $50/month (SQS) Effort: 2 weeks
Level 4: Data Layer Optimization (Months)
4.1 Database Sharding
Problem: Single DB too large Solution: Shard by user_id
Shard 1: user_id 0-24999
Shard 2: user_id 25000-49999
Shard 3: user_id 50000-74999
Shard 4: user_id 75000-99999
Expected Impact: 4x capacity Cost: $1,200/month Effort: 2 months
4.2 Event-Driven Architecture
Problem: Tight coupling, cascading failures Solution: Message broker (Kafka)
Service A → Kafka → Service B
↘ ↗ Service C
Expected Impact: Better isolation, resilience Cost: $500/month Effort: 3 months
Scaling Triggers
| Metric | Current | Warning | Critical | Action |
| ---------------- | ------- | ------- | -------- | ----------------------- |
| CPU | 40% | 70% | 85% | Add servers |
| Memory | 50% | 75% | 90% | Upgrade instances |
| DB Connections | 20 | 40 | 50 | Add read replicas |
| Query Time (p95) | 200ms | 500ms | 1000ms | Add indexes |
| Queue Depth | 100 | 1000 | 5000 | Add workers |
| Error Rate | 0.1% | 1% | 5% | Investigate immediately |
Phased Scaling Plan
Phase 1: Current → 10x (0-3 months)
Target: 10,000 req/min, 100K users
Actions:
- Add database indexes (Week 1)
- Implement Redis caching (Week 2)
- Add 3x read replicas (Week 4)
- Horizontal scale app servers (Week 6)
- CDN for static assets (Week 8)
Cost: $500 → $1,000/month
Phase 2: 10x → 100x (3-12 months)
Target: 100,000 req/min, 1M users
Actions:
- Database sharding (Month 4-6)
- Multi-region deployment (Month 6-8)
- Microservices extraction (Month 8-12)
- Event-driven architecture (Month 10-12)
Cost: $1,000 → $10,000/month
Phase 3: 100x → 1000x (12-24 months)
Target: 1M req/min, 10M users
Actions:
- Global CDN (Month 13)
- Advanced caching (L1/L2) (Month 14-15)
- Custom DB solutions (Month 16-18)
- Edge computing (Month 18-20)
Cost: $10,000 → $100,000/month
Load Testing Plan
# Current baseline
hey -n 10000 -c 100 https://api.example.com/users
# Target 10x
hey -n 100000 -c 1000 https://api.example.com/users
# Measure:
# - Requests/sec
# - p50, p95, p99 latency
# - Error rate
# - Resource utilization
Cost-Benefit Analysis
| Strategy | Cost/Month | Expected Impact | ROI | Priority |
| ------------- | ---------- | ------------------ | --- | -------- |
| DB Indexes | $0 | 80% faster queries | ∞ | HIGH |
| Redis Cache | $50 | 60% less DB load | 12x | HIGH |
| Read Replicas | $300 | 3x capacity | 10x | MEDIUM |
| Load Balancer | $400 | 3x throughput | 7x | MEDIUM |
| DB Sharding | $1,200 | 4x capacity | 3x | LOW |
Best Practices
- Measure first: Don't optimize blindly
- Low-hanging fruit: Start with easy wins
- Load test: Validate before production
- Monitor continuously: Set up alerts
- Plan ahead: Scale before hitting limits
- Cost-conscious: ROI-driven decisions
- Incremental: Small, safe changes
Output Checklist
- Current system profile
- Bottlenecks identified and measured
- Scaling strategies ordered by effort
- Triggers defined for each action
- Phased plan (1x → 10x → 100x)
- Cost estimates per phase
- Load testing plan
- Monitoring dashboard
- Rollback procedures