kleppmann-data-intensive
SKILL: Designing Data-Intensive Systems (Kleppmann)
Source: Designing Data-Intensive Applications by Martin Kleppmann Domain: Distributed systems, data architecture, reliability engineering Applies to: Building systems where data complexity (not computation) is the bottleneck
DECISION POINTS
1. Consistency Model Selection
IF: Strong consistency required (banking, inventory)
AND: Can tolerate higher latency + coordination overhead
THEN: Use linearizability with synchronous replication
IF: Operations need ordering but not global agreement
AND: User experience matters more than strict consistency
THEN: Use causal consistency (preserve cause-effect, allow concurrent ops)
IF: High availability required during network partitions
AND: Can resolve conflicts application-side
THEN: Accept eventual consistency with conflict resolution
IF: Read-your-writes is critical but global consistency isn't
AND: Users mostly operate on their own data
THEN: Route user reads to leader/use session consistency
2. Scaling Architecture Decisions
IF: tail latency > 500ms + replication lag > 1s
AND: Cache hit rate < 80%
THEN: Switch to local quorum reads, accept bounded staleness
IF: Write throughput bottleneck identified
AND: Operations can be partitioned by key
THEN: Implement horizontal partitioning with partition-local transactions
IF: Cross-partition queries frequent
AND: Eventual consistency acceptable for derived data
THEN: Use CQRS pattern (separate write/read paths)
IF: Coordination overhead dominates response time
AND: Operations can be made idempotent
THEN: Replace distributed locks with compare-and-set operations
3. Failure Handling Strategy
IF: Component failure detected (timeout/error)
AND: Operation might have succeeded
THEN: Make operation idempotent, use unique request IDs for retry
IF: Distributed resource coordination required
AND: Process pauses/network delays possible
THEN: Implement fencing tokens (resource rejects lower-numbered tokens)
IF: Multi-step workflow spans services
AND: Atomic rollback needed
THEN: Use saga pattern with compensating transactions, not 2PC
IF: Service dependency causing tail latency spikes
THEN: Implement circuit breaker + hedged requests after timeout threshold
FAILURE MODES
1. Split-Brain Coordination
Symptoms: Two nodes both believe they're the leader, conflicting writes accepted Root Causes: Network partition + inadequate quorum checking + lease expiry race conditions Detection Rule: If you see duplicate primary keys or "impossible" data states after network events Fixes (ranked by speed/safety):
- Immediate: Enable fencing tokens, reject operations from stale leaders
- Short-term: Implement proper quorum (majority must agree on leader)
- Long-term: Use consensus algorithm (Raft/Paxos) for leader election
2. Cascade Failure Amplification
Symptoms: Single component failure causes system-wide outage, p99 latency spike across all services Root Causes: Synchronous dependencies + no circuit breakers + retry storms + unbounded queues Detection Rule: If failure rate increases exponentially rather than linearly with initial fault Fixes (ranked by speed/safety):
- Immediate: Deploy circuit breakers, implement fail-fast with timeouts
- Short-term: Add backpressure mechanisms, bound all queues
- Long-term: Break synchronous dependencies, use asynchronous messaging
3. Read-After-Write Disappearing Data
Symptoms: User writes data, immediately reads and sees old value, claims "data was lost" Root Causes: Async replication lag + load balancer routes read to stale replica + no session affinity Detection Rule: If user complaints about "lost data" correlate with write-then-read patterns Fixes (ranked by speed/safety):
- Immediate: Route user reads to primary for N seconds after write
- Short-term: Implement read-your-writes consistency with logical timestamps
- Long-term: Use strongly consistent reads for user-critical paths
4. Phantom Distributed Lock
Symptoms: Multiple processes acquire same lock simultaneously, resource corruption occurs Root Causes: Lock service uses timeouts without fencing + GC pauses + network delays exceed lease time Detection Rule: If you see lock violation errors or concurrent modification of "protected" resources Fixes (ranked by speed/safety):
- Immediate: Implement fencing tokens, resource validates token before operation
- Short-term: Use compare-and-set instead of locks where possible
- Long-term: Redesign to avoid coordination, use immutable data structures
5. Thundering Herd Cache Stampede
Symptoms: Cache expires, all requests hit database simultaneously, database overloads Root Causes: Cache expiry + no request deduplication + synchronous cache population + high concurrency Detection Rule: If database load spikes correlate with cache miss events Fixes (ranked by speed/safety):
- Immediate: Implement cache request coalescing, only one thread populates
- Short-term: Use probabilistic cache refresh before expiry
- Long-term: Implement cache warming and gradual expiry spreading
WORKED EXAMPLES
Example: E-commerce Inventory Management
Scenario: User adds last item to cart, another user tries same item simultaneously
Decision Process:
- Identify coordination requirement: Need atomic decrement of inventory count
- Assess consistency needs: Overselling is unacceptable (lost revenue + customer complaints)
- Choose approach: Use linearizable reads/writes for inventory, eventual consistency for recommendations
Implementation:
-- Atomic inventory check with compare-and-set
UPDATE inventory
SET quantity = quantity - 1, version = version + 1
WHERE product_id = ? AND quantity >= 1 AND version = ?
Novice would miss: Using SELECT then UPDATE (race condition window) Expert catches: Version field prevents lost updates, quantity check prevents overselling
Fallback handling:
- IF update affects 0 rows (item sold out): Return "out of stock" immediately
- IF operation times out: Use fencing token to prevent duplicate decrement
- IF database unavailable: Fail fast rather than queue requests indefinitely
Example: Social Feed Consistency
Scenario: User posts update, immediately checks feed, doesn't see their post
Decision Process:
- Analyze user expectation: Seeing own posts matters, seeing others' posts can be delayed
- Identify load parameters: Write fan-out to followers, not write volume
- Choose consistency model: Causal consistency (preserve user's action order)
Implementation Strategy:
- Write to user's timeline synchronously (read-your-writes)
- Fan out to followers asynchronously (eventual consistency)
- Use logical timestamps to prevent time-travel anomalies
Quality validation:
- User always sees their own posts immediately
- Followers see posts within bounded time (monitor lag)
- No posts appear before their replies
QUALITY GATES
- Consistency model explicitly chosen and anomalies catalogued
- Load parameters identified and instrumented (p99 latency, fan-out ratio, partition skew)
- Failure modes tested: component failures, network partitions, process pauses
- Coordination minimized: operations partitioned, idempotency implemented
- Tail latency measured: p99 and p999 tracked separately from averages
- Quorum mathematics verified: w + r > n for required consistency level
- Circuit breakers deployed: timeout and error rate thresholds configured
- Observability complete: can trace requests across service boundaries
- Chaos testing implemented: deliberate fault injection in non-production
- Rollback strategy proven: can revert changes without data loss
NOT-FOR BOUNDARIES
This skill should NOT be used for:
- Pure computational problems (ML training, rendering) → Use parallel-processing patterns instead
- Single-machine applications with local data → Use database-patterns skill instead
- Systems where "eventual consistency" means "never consistent" is acceptable → Use eventual-consistency-design skill instead
- Real-time systems with hard latency bounds → Use real-time-systems skill instead
Delegate to other skills:
- For message queue design → Use messaging-patterns skill
- For database schema design → Use data-modeling skill
- For microservices boundaries → Use service-decomposition skill
- For monitoring/alerting setup → Use observability skill
More from curiositech/windags-skills
beautiful-cli-design
|
2next-move
>
1admin-dashboard
Extend and modify the admin dashboard, developer portal, and operations console. Use when adding new admin tabs, metrics, monitoring features, or internal tools. Activates for dashboard development, analytics, user management, and internal tooling.
1color-contrast-auditor
Detects and fixes color contrast violations using WCAG 2.1 guidelines and perceptual analysis. Expert in contrast ratio calculation, color blindness simulation, and providing accessible alternatives. Activate on "check contrast", "color accessibility", "WCAG audit", "readability check", "contrast ratio", "hard to read", "can't see text". NOT for general color theory (use color-theory-palette-harmony-expert), brand color selection (use web-design-expert), or non-visual accessibility (use ux-friction-analyzer).
1web-design-expert
Creates unique web designs with brand identity, color palettes, typography, and modern UI/UX patterns. Use for brand identity development, visual design systems, layout composition, and responsive web design. Activate on "web design", "brand identity", "color palette", "UI design", "visual design", "layout". NOT for typography details (use typography-expert), color theory deep-dives (use color-theory-expert), design system tokens (use design-system-creator), or code implementation without design direction.
1rust-tauri-development
Expert Tauri v2 developer for building desktop apps with Rust backend and web frontend. Activate on: Tauri app, Tauri v2, Rust desktop app, IPC commands, tauri::command, tauri.conf.json, Tauri plugin, WebviewWindow, system tray Tauri, Tauri multi-window. NOT for: Electron apps (use cross-platform-desktop), code signing/distribution (use rust-app-distribution), pure Rust CLI tools (use rust-expert).
1