ddia-principles

SKILL.md

Designing Data-Intensive Applications — Distilled Guide

Source: Martin Kleppmann, Designing Data-Intensive Applications Central thesis: Data is the core challenge of modern applications — not compute.


Part I: Foundations of Data Systems

Chapter 1: Reliability, Scalability, Maintainability

Three Pillars

Pillar Definition Key Metric
Reliability System works correctly even when faults occur Fault ≠ Failure; tolerate faults, prevent failures
Scalability System handles load growth gracefully Measure with percentiles: p50, p95, p99, p999
Maintainability System is easy to operate, understand, evolve Operability + Simplicity + Evolvability

Fault Categories

  • Hardware: Random, independent (disk, RAM, power). Mitigate with redundancy (RAID, dual power).
  • Software: Systematic bugs affecting all nodes simultaneously (leap-second bug). Mitigate with process isolation, monitoring, chaos engineering.
  • Human: #1 cause of outages (config errors). Mitigate with good abstractions, sandboxes, canary deployments, fast rollback.

Scalability Patterns

  • Vertical (scale-up): Bigger machine. Simple but has ceiling.
  • Horizontal (scale-out): More machines (shared-nothing). Complex but unlimited.
  • Elastic: Auto-scale on load detection. Good for unpredictable workloads.

Twitter fan-out case study: 4.6k writes/s but 300k reads/s. Solution: pre-compute timelines (write fan-out) for most users; read-time merge for celebrities.

Performance: Use Percentiles, Not Averages

  • p50 = median. p99 = tail latency matters for user experience.
  • Amazon: 100ms delay = 1% revenue loss.
  • Tail latency amplification: One slow backend call slows entire parallel request.

Chapter 2: Data Models & Query Languages

Model Selection Guide

Model Best For Weakness
Relational Structured data, complex joins, ACID transactions Rigid schema, impedance mismatch with OOP
Document Hierarchical data, flexible schema, data locality Poor joins, many-to-many relationships
Graph Highly connected data, variable-depth traversals Less mature tooling, harder to partition

Schema Strategy

  • Schema-on-write (relational): Enforce structure at write time. Early error detection, migration cost.
  • Schema-on-read (document): Interpret structure at read time. Flexible but validation burden on app.

Normalization vs Denormalization

  • Normalize: Single source of truth, consistent updates, requires joins.
  • Denormalize: Faster reads, risks inconsistency, update anomalies.

Trend: Models converge — PostgreSQL supports JSON, MongoDB added joins. Choose based on access patterns, not ideology.


Chapter 3: Storage & Retrieval

Storage Engine Comparison

Feature B-Tree LSM-Tree
Write throughput Lower (in-place update + WAL) Higher (sequential append)
Read latency More predictable May check multiple SSTables
Write amplification Higher Lower
Space efficiency Fragmentation possible Better compression
Transaction support Simpler (lock on tree node) More complex
Used by PostgreSQL, MySQL, Oracle LevelDB, RocksDB, Cassandra

OLTP vs OLAP

Aspect OLTP OLAP
Access Random, few records Sequential scan, millions of rows
Users End users Analysts
Data Current state Historical events
Scale GB–TB TB–PB
Optimize for Low latency Throughput

Data Warehousing

  • ETL: Extract from OLTP → Transform → Load into warehouse.
  • Star schema: Central fact table (events) + dimension tables (attributes). Fact tables can have 100+ columns and petabyte scale.
  • Column-oriented storage: Store each column separately. Huge I/O savings when queries touch few columns. Enables bitmap encoding, run-length compression, vectorized processing.

Chapter 4: Encoding & Evolution

Format Comparison

Format Size (example) Schema Evolution Cross-language
JSON 81 bytes Implicit Manual Excellent
Thrift 59 bytes Required Field tags Good
Protobuf 33 bytes Required Field tags Excellent
Avro 32 bytes Required Name matching Good

Compatibility Rules

  • Backward compatible: New code reads old data. (Always required)
  • Forward compatible: Old code reads new data. (Required for rolling upgrades)
  • Rule: Only add/remove fields with default values. Never reuse deleted field tags.

Data Flow Patterns

  1. Via databases: Multiple code versions coexist during rolling deploys. Data outlives code.
  2. Via services (REST/RPC): Servers update before clients. Backward compat on requests, forward compat on responses.
  3. Via async messaging: Decouples producers/consumers. Supports independent version evolution.

Avoid: Language-specific serialization (Java Serializable, Python pickle) — vendor lock-in + security risk.


Part II: Distributed Data

Chapter 5: Replication

Replication Models

Model Writes Conflict Use Case
Single-leader One node None Most common (PostgreSQL, MySQL)
Multi-leader Multiple nodes Must resolve Multi-datacenter, offline clients
Leaderless Any node Must resolve Cassandra, Riak, Voldemort

Sync vs Async Replication

  • Sync: Durable, blocks on replica failure.
  • Async: Fast, risks data loss on leader failure.
  • Semi-sync: One replica sync, rest async. Practical compromise.

Replication Lag Problems & Solutions

Problem Symptom Solution
Read-after-write User doesn't see own write Read from leader for user's own data
Monotonic reads Data goes backward in time Stick user to one replica
Consistent prefix reads Causal order violated Write causally related data to same partition

Conflict Resolution

  • Last-Write-Wins (LWW): Simple but loses data. Only safe if keys are immutable.
  • Merge: Union values, concatenate, CRDT data structures.
  • Application-level: Return all versions ("siblings"), let app decide.
  • Version vectors: Track causal dependencies per replica.

Quorum: w + r > n

  • w = write acknowledgments, r = read queries, n = total replicas.
  • Sloppy quorum: Accept writes on non-home nodes during partitions (hinted handoff). Improves availability, weakens consistency.

Chapter 6: Partitioning (Sharding)

Partitioning Strategies

Strategy Pros Cons
Key-range Efficient range queries Hotspot risk on sequential keys
Hash Even distribution No range queries
Compound First part hashed, rest sorted More complex, best of both

Secondary Index Partitioning

  • Local (document-based): Each partition indexes its own data. Writes simple, reads scatter-gather.
  • Global (term-based): Index partitioned by term. Reads efficient, writes update multiple partitions.

Rebalancing

  • Fixed partition count: More partitions than nodes. Redistribute on node changes. (Riak, Elasticsearch)
  • Dynamic: Split/merge based on size. (HBase, RethinkDB)
  • Proportional to nodes: Fixed partitions per node. (Cassandra)

Request Routing

  • Round-robin to any node (node forwards if needed)
  • Routing layer (partition-aware proxy)
  • Client-aware (client knows partition map)
  • ZooKeeper: Authoritative partition → node mapping. Used by HBase, Kafka.

Chapter 7: Transactions

Isolation Levels (Weakest → Strongest)

Level Prevents Allows Implementation
Read Committed Dirty reads, dirty writes Non-repeatable reads, lost updates Row locks + old value copy
Snapshot Isolation + Non-repeatable reads Write skew, phantoms MVCC (multi-version)
Serializable Everything Nothing 2PL, serial execution, or SSI

Concurrency Anomalies

Anomaly Description Example
Dirty read See uncommitted data Reading half-written transfer
Dirty write Overwrite uncommitted data Two buyers "winning" same item
Lost update Read-modify-write race Two concurrent counter increments
Write skew Decision based on stale read Two doctors both going off-call
Phantom New rows change query result Meeting room double-booking

Serializable Implementations

  1. Serial execution: Single thread, in-memory. Fast but limited throughput. (VoltDB, Redis)
  2. Two-Phase Locking (2PL): Shared/exclusive locks held until commit. Strong but slow, deadlock-prone.
  3. SSI (Serializable Snapshot Isolation): Optimistic — execute freely, detect conflicts at commit. Best performance for read-heavy workloads. (PostgreSQL 9.1+)

Chapter 8: Troubles with Distributed Systems

The Three Unreliabilities

Networks: Async packet networks — no delivery guarantee, no timing guarantee. Cannot distinguish crash from network delay. Timeouts are the only failure detector, but no correct timeout value exists.

Clocks:

  • Wall clocks: Can jump backward (NTP correction). Never use for ordering events.
  • Monotonic clocks: Safe for elapsed time, not cross-node comparison.
  • Quartz drift: ~200ppm → 6ms error every 30 seconds.

Processes: GC pauses, VM suspension, page faults — threads stop without warning. A paused node doesn't know time passed.

Key Principles

  • Fault ≠ failure: Design for partial failures. Some nodes work while others don't.
  • Truth is defined by majority: Individual nodes cannot determine system state alone. Quorum votes decide.
  • Fencing tokens: Monotonically increasing tokens prevent zombie processes from corrupting state.
  • Safety vs liveness: Safety (bad things never happen) must hold always. Liveness (good things eventually happen) may have conditions.

Chapter 9: Consistency & Consensus

Consistency Models (Strongest → Weakest)

Model Guarantee Cost
Linearizability Behaves as if one copy, all ops atomic High latency, reduced availability during partition
Causal consistency Respects cause-effect ordering Better performance, partition-tolerant
Eventual consistency Replicas converge eventually Best performance, weakest guarantee

Linearizability Use Cases

  • Leader election / distributed locks
  • Uniqueness constraints (usernames, filenames)
  • Cross-channel coordination (message queue + storage)

Consensus Algorithms

  • 2PC: Coordinator-based, blocking on coordinator failure. Practical but fragile.
  • Paxos/Raft/Zab: Epoch-based leader election + quorum voting. Non-blocking. Used by etcd, ZooKeeper, Consul.
  • FLP impossibility: Consensus impossible in pure async systems with crashes. Practical algorithms use timeouts.

Total Order Broadcast ≡ Consensus ≡ Linearizable CAS

These three problems are mathematically equivalent. Solving one solves all.

ZooKeeper / etcd Pattern

  • Small consensus cluster (3–5 nodes) for coordination.
  • Linearizable atomic CAS operations.
  • Failure detection via session heartbeats.
  • Applications: leader election, partition assignment, distributed locks, service discovery.

Part III: Derived Data

Chapter 10: Batch Processing

Unix Philosophy → MapReduce

  • Each program does one thing well.
  • Output of one program = input of another.
  • Immutable inputs, deterministic processing.

MapReduce Pipeline

Input → Mapper (extract key-value) → Sort/Partition → Reducer (aggregate by key) → Output

Distributed Join Strategies

Join Type When How
Sort-merge Both inputs large Sort by join key, merge in reducer
Broadcast hash One input small (fits in RAM) Load small side as hash table
Partitioned hash Both inputs partitioned identically Per-partition hash join

Beyond MapReduce: Dataflow Engines (Spark, Flink, Tez)

  • Treat entire workflow as single job.
  • Pipeline intermediate results (avoid full materialization to HDFS).
  • Keep data in memory where possible.
  • Track computation lineage for fault recovery (RDDs).
  • Support iterative algorithms (graph processing via Pregel/BSP model).

Chapter 11: Stream Processing

Message Broker Models

Model Delivery Ordering Replay Use Case
AMQP/JMS Per-message ack, delete after No ordering guarantee No Task queues, async RPC
Log-based (Kafka) Offset-based, retained Per-partition ordering Yes Event streaming, CDC

Change Data Capture (CDC)

Extract database changes as event stream → keep derived systems (search indexes, caches, warehouses) in sync. Source of truth stays in database; derived views are consumers.

Event Sourcing

Model state as append-only sequence of business events (not DB operations). Events are immutable facts. Current state = fold over event history.

Stream Joins

Join Input A Input B State
Stream-Stream Events Events Time-windowed buffer
Stream-Table Events DB snapshot (via CDC) Local materialized table
Table-Table CDC stream CDC stream Derived materialized view

Time & Windowing

Window Description
Tumbling Fixed-size, non-overlapping (e.g., every 1 min)
Hopping Fixed-size, overlapping (e.g., 1 min window every 30s)
Sliding All events within time threshold of each other
Session Grouped by activity gap (e.g., 30 min inactivity)

Event time ≠ processing time. Always use event time for correctness. Handle late events with watermarks or correction publishes.

Processing Guarantees

  • Microbatching (Spark Streaming): ~1s latency, atomic small batches.
  • Checkpointing (Flink): Periodic snapshots with message barriers.
  • Idempotency: Deduplicate using message offsets or unique IDs.
  • End-to-end exactly-once requires idempotent output + deduplication.

Chapter 12: The Future of Data Systems

Data Integration Pattern

No single database does everything. Use event log as integration backbone:

  1. All writes go through authoritative event log.
  2. Derived systems (indexes, caches, ML models) consume the log.
  3. Deterministic, idempotent functions transform between layers.

Unbundling Databases

Separate concerns:

  • Record system: Captures authoritative writes.
  • Derived systems: Indexes, caches, materialized views consume change streams.
  • Enables gradual migration — run old and new systems in parallel.

End-to-End Exactly-Once

Low-level guarantees (TCP, DB transactions) don't ensure application correctness. Require:

  • Operation identifiers: UUID-based deduplication at application level.
  • Idempotent operations: Same effect whether executed once or many times.
  • Unique constraint enforcement: Via partitioned stream processing.

Async Constraint Enforcement

Instead of distributed transactions:

  1. Route requests by constraint field to partitioned log.
  2. Stream processor sequences competing requests.
  3. Reject violations, notify clients via output stream.
  4. Some applications tolerate temporary violations with compensating transactions.

Auditability

  • Treat data like immutable event log — enables reconstruction and verification.
  • Implement cryptographic audit trails (Merkle trees).
  • Periodically test backup restoration and data reconstruction.

Decision Framework: Quick Reference

Choosing a Data Model

Many-to-many relationships?     → Relational or Graph
Hierarchical / nested data?     → Document
Highly connected data?          → Graph
Flexible / evolving schema?     → Document (schema-on-read)
Strong consistency required?    → Relational (ACID)

Choosing a Storage Engine

Write-heavy workload?           → LSM-tree (RocksDB, Cassandra)
Read-heavy, predictable?        → B-tree (PostgreSQL, MySQL)
Analytical queries?             → Column store (ClickHouse, Redshift)
Full-text search?               → Inverted index (Elasticsearch)

Choosing a Replication Strategy

Single datacenter?              → Single-leader
Multi-datacenter?               → Multi-leader
Offline-first clients?          → Multi-leader or Leaderless
Maximum availability?           → Leaderless with sloppy quorum
Strong consistency?             → Single-leader with sync replication

Choosing an Isolation Level

Read-only analytics?            → Snapshot isolation
General OLTP?                   → Read committed (default in most DBs)
Financial / critical?           → Serializable (prefer SSI over 2PL)
High write contention?          → Serial execution (if data fits in RAM)

Choosing Batch vs Stream

Historical data reprocessing?   → Batch (Spark, Flink batch mode)
Real-time derived views?        → Stream (Kafka + Flink/Spark Streaming)
Both needed?                    → Unified engine (Flink) over Lambda architecture
Weekly Installs
2
GitHub Stars
126
First Seen
13 days ago
Installed on
mcpjam2
iflow-cli2
kilo2
replit2
junie2
windsurf2