scalability-playbook
Scalability Playbook
Systematic approach to identifying and resolving scalability bottlenecks.
Bottleneck Analysis
Current System Profile
Traffic: 1,000 req/min
Users: 10,000 active
Data: 100GB database
Response time: p95 = 500ms
Identified Bottlenecks
1. Database Queries
Symptom: Slow page loads (2-3s) Measurement: Query time p95 = 800ms Impact: HIGH - affects all reads Trigger: When p95 >500ms
2. Single Server
Symptom: High CPU (>80%) Measurement: Load average >4 Impact: MEDIUM - intermittent slowdowns Trigger: When CPU >70%
3. No Caching
Symptom: Repeated DB queries Measurement: Cache hit rate = 0% Impact: MEDIUM - unnecessary load Trigger: When query volume >10k/min
Scaling Strategies (Ordered)
Level 1: Quick Wins (Days)
1.1 Add Database Indexes
Problem: Slow queries Solution:
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at);
Expected Impact: 80% faster queries Cost: $0 Effort: 1 day
1.2 Enable Query Caching
Problem: Repeated queries Solution: Redis cache layer
const cached = await redis.get(`user:${userId}`);
if (cached) return JSON.parse(cached);
const user = await db.users.findById(userId);
await redis.setex(`user:${userId}`, 3600, JSON.stringify(user));
Expected Impact: 60% reduction in DB load Cost: $50/month Effort: 2 days
Level 2: Horizontal Scaling (Weeks)
2.1 Add Read Replicas
Problem: Read-heavy workload Solution: Route reads to replicas
Write Load: Primary DB
Read Load: 3x Read Replicas
Expected Impact: 3x read capacity Cost: $300/month Effort: 1 week
2.2 Load Balancer + Multiple Servers
Problem: Single point of failure Solution:
ALB
├── Server 1
├── Server 2
└── Server 3
Expected Impact: 3x throughput Cost: $400/month Effort: 1 week
Level 3: Architecture Changes (Months)
3.1 CDN for Static Assets
Problem: Slow asset delivery Solution: CloudFront CDN Expected Impact: 90% faster asset loads Cost: $100/month Effort: 1 week
3.2 Async Processing
Problem: Slow sync operations Solution: Background job queues
// Before: Sync
await sendEmail(user);
await processPayment(order);
await updateAnalytics(event);
return response; // Waits 5+ seconds
// After: Async
await queue.add("send-email", { userId });
await queue.add("process-payment", { orderId });
await queue.add("update-analytics", { event });
return response; // Returns immediately
Expected Impact: 80% faster responses Cost: $50/month (SQS) Effort: 2 weeks
Level 4: Data Layer Optimization (Months)
4.1 Database Sharding
Problem: Single DB too large Solution: Shard by user_id
Shard 1: user_id 0-24999
Shard 2: user_id 25000-49999
Shard 3: user_id 50000-74999
Shard 4: user_id 75000-99999
Expected Impact: 4x capacity Cost: $1,200/month Effort: 2 months
4.2 Event-Driven Architecture
Problem: Tight coupling, cascading failures Solution: Message broker (Kafka)
Service A → Kafka → Service B
↘ ↗ Service C
Expected Impact: Better isolation, resilience Cost: $500/month Effort: 3 months
Scaling Triggers
| Metric | Current | Warning | Critical | Action |
| ---------------- | ------- | ------- | -------- | ----------------------- |
| CPU | 40% | 70% | 85% | Add servers |
| Memory | 50% | 75% | 90% | Upgrade instances |
| DB Connections | 20 | 40 | 50 | Add read replicas |
| Query Time (p95) | 200ms | 500ms | 1000ms | Add indexes |
| Queue Depth | 100 | 1000 | 5000 | Add workers |
| Error Rate | 0.1% | 1% | 5% | Investigate immediately |
Phased Scaling Plan
Phase 1: Current → 10x (0-3 months)
Target: 10,000 req/min, 100K users
Actions:
- Add database indexes (Week 1)
- Implement Redis caching (Week 2)
- Add 3x read replicas (Week 4)
- Horizontal scale app servers (Week 6)
- CDN for static assets (Week 8)
Cost: $500 → $1,000/month
Phase 2: 10x → 100x (3-12 months)
Target: 100,000 req/min, 1M users
Actions:
- Database sharding (Month 4-6)
- Multi-region deployment (Month 6-8)
- Microservices extraction (Month 8-12)
- Event-driven architecture (Month 10-12)
Cost: $1,000 → $10,000/month
Phase 3: 100x → 1000x (12-24 months)
Target: 1M req/min, 10M users
Actions:
- Global CDN (Month 13)
- Advanced caching (L1/L2) (Month 14-15)
- Custom DB solutions (Month 16-18)
- Edge computing (Month 18-20)
Cost: $10,000 → $100,000/month
Load Testing Plan
# Current baseline
hey -n 10000 -c 100 https://api.example.com/users
# Target 10x
hey -n 100000 -c 1000 https://api.example.com/users
# Measure:
# - Requests/sec
# - p50, p95, p99 latency
# - Error rate
# - Resource utilization
Cost-Benefit Analysis
| Strategy | Cost/Month | Expected Impact | ROI | Priority |
| ------------- | ---------- | ------------------ | --- | -------- |
| DB Indexes | $0 | 80% faster queries | ∞ | HIGH |
| Redis Cache | $50 | 60% less DB load | 12x | HIGH |
| Read Replicas | $300 | 3x capacity | 10x | MEDIUM |
| Load Balancer | $400 | 3x throughput | 7x | MEDIUM |
| DB Sharding | $1,200 | 4x capacity | 3x | LOW |
Best Practices
- Measure first: Don't optimize blindly
- Low-hanging fruit: Start with easy wins
- Load test: Validate before production
- Monitor continuously: Set up alerts
- Plan ahead: Scale before hitting limits
- Cost-conscious: ROI-driven decisions
- Incremental: Small, safe changes
Output Checklist
- Current system profile
- Bottlenecks identified and measured
- Scaling strategies ordered by effort
- Triggers defined for each action
- Phased plan (1x → 10x → 100x)
- Cost estimates per phase
- Load testing plan
- Monitoring dashboard
- Rollback procedures
More from monkey1sai/openai-cli
multi-tenant-safety-checker
Ensures tenant isolation at query and policy level using Row Level Security, automated testing, and security audits. Prevents data leakage between tenants. Use for "multi-tenancy", "tenant isolation", "RLS", or "data security".
10modal-drawer-system
Implements accessible modals and drawers with focus trap, ESC to close, scroll lock, portal rendering, and ARIA attributes. Includes sample implementations for common use cases like edit forms, confirmations, and detail views. Use when building "modals", "dialogs", "drawers", "sidebars", or "overlays".
10eslint-prettier-config
Configures ESLint and Prettier for consistent code quality with TypeScript, React, and modern best practices. Use when users request "ESLint setup", "Prettier config", "linting configuration", "code formatting", or "lint rules".
9api-security-hardener
Hardens API security with rate limiting, input validation, authentication, and protection against common attacks. Use when users request "API security", "secure API", "rate limiting", "input validation", or "API protection".
9secure-headers-csp-builder
Implements security headers and Content Security Policy with safe rollout strategy (report-only → enforce), testing, and compatibility checks. Use for "security headers", "CSP", "HTTP headers", or "XSS protection".
9security-incident-playbook-generator
Creates response procedures for security incidents with containment steps, communication templates, and evidence collection. Use for "incident response", "security playbook", "breach response", or "IR plan".
9