latency-principles
Latency Principles
Based on "Latency: Reduce delay in software systems" by Pekka Enberg (Manning, 2026).
This skill provides a systematic framework for minimizing delay in software systems, covering the entire stack from Physics and Hardware to Application Architecture and User Experience.
Foundational Concepts
- What is Latency?: The time delay between a cause and its observed effect.
- Latency vs. Bandwidth: You can always add more bandwidth (more links), but you are stuck with bad latency unless you optimize the path.
- Impact on User Experience (UX): < 100ms (immediate), < 1s (instant), > 10s (slow).
- Measuring Correctly: Use percentiles (p95, p99) to capture tail latency. Avoid averages. Avoid Coordinated Omission by using fixed-interval benchmarking.
Modeling Performance
Use these laws to establish theoretical bounds and size systems:
- Little's Law:
Concurrency (C) = Throughput (T) * Latency (L)- Usage: Calculate required concurrency to support a target throughput at a given latency.
- Example: If p99 latency is 50ms and you need 1000 RPS, you need
1000 * 0.05 = 50concurrent execution units.
- Amdahl's Law:
Speedup = 1 / ((1 - P) + (P / N))P: Portion of program that is parallelizable.N: Number of processors.- Usage: Understand the limits of parallelization. If 50% of your code is serial, the max speedup is 2x, regardless of how many cores you add.
Measurement & Visualization
Visualizing latency distributions is critical for identifying tail behavior:
- Histograms: Show frequency of samples. Good for seeing the "mode" (most common latency) and the spread.
- HDR Histograms: Plot latency against percentiles (p50, p99, etc.) on a log scale. Essential for p99.9+ analysis.
- eCDF (Empirical Cumulative Distribution Function): A smooth curve showing the probability that a request completes within a given time. Directly answers SLA compliance questions.
Tooling:
- Use
scripts/ping_collector.pyto gather data without Coordinated Omission. - Use
scripts/visualize_latency.pyto generate Histogram, HDR, and eCDF plots.
Quick Decision Guide
| Symptom | Probable Cause | Recommended Strategy |
|---|---|---|
| High Avg Latency | Sequential processing / Slow I/O | Concurrency (Async I/O) or Partitioning |
| High Tail Latency (p99) | Lock contention / GC / Neighbor noise | Wait-free Sync (Atomics) or Request Hedging |
| Network Slowness | Distance / Protocol overhead | Colocation (Edge) or Binary Serialization (Protobuf) |
| Database Load | Hot keys / Complex queries | Caching (Read-through) or Materialized Views |
| Slow Writes | ACID guarantees / Indexing | Write-Behind Caching or Sharding |
| High CPU Usage | O(n^2) logic / JSON parsing | Algorithmic Fixes or Protobuf/FlatBuffers |
| Micro-stutters | GC pauses / OS interrupts | Object Pooling or Interrupt Affinity |
| Lock Contention | Mutex bottleneck | Wait-free Sync (Atomics) |
| "It feels slow" | UI blocking on network | Optimistic Updates or Prefetching |
| Measurement Looks Too Good | Coordinated Omission | Fixed-Interval Benchmarking |
Part 1: Fundamentals (Start Here)
Core theory and diagnostic approaches.
- references/principles.md: Definitions of Little's Law, Amdahl's Law, Tail Latency, Compounding Latency, and Latency Distribution.
- references/diagnostic_checklist.md: Step-by-step checklist for identifying bottlenecks across Hardware, OS, and Runtime.
Part 2: Data Layer (Access Optimization)
Optimizing data access is often the highest-leverage activity for reducing latency.
- Colocation (Pattern: Move Compute to Data):
- Edge Computing: Use Near Edge (points of presence) or Far Edge (on-device/IoT) to eliminate geographical distance.
- Intranode: Colocate protocol handlers with application threads. Turn off Nagle's Algorithm (
TCP_NODELAY) to prevent packet batching delays. - Kernel-bypass: Use techniques like DPDK to eliminate OS stack overhead.
- Replication (Pattern: Trade Consistency for Latency):
- Leaderless/Multi-Leader: Allows local writes to reduce write latency, at the cost of complex conflict resolution.
- Consistency Models: Choose Eventual Consistency or Read-your-writes to avoid synchronous coordination (Strong Consistency) overhead.
- Partitioning (Pattern: Divide and Conquer):
- Horizontal Sharding: Splitting data to increase parallel throughput.
- Request Routing: Use Direct Routing (client-side) to avoid the extra hop of a proxy.
- Mitigate Hot Partitions: Use over-partitioning to balance skewed workloads.
- Caching (Pattern: Memory is Faster than Disk):
- Write-Behind Caching: Asynchronous writes to the data store to hide write latency.
- Policy Selection: Use SIEVE or LRU for eviction.
- Materialized Views: Precompute complex queries to eliminate runtime processing work.
See references/data_patterns.md for detailed implementation strategies.
Part 3: Compute Layer (Logic Acceleration)
Optimizing processing logic and synchronization to eliminate overhead.
- Eliminating Work (Pattern: The Fastest Code is Code that Doesn't Run):
- Algorithmic Complexity: Replace O(n) scans with O(log n) trees or O(1) hash maps.
- Zero-Copy Serialization: Use FlatBuffers instead of JSON to eliminate the parsing/unpacking step.
- Memory Tuning: Avoid dynamic allocation (
malloc/new) in hot paths. Use Object Pooling or Stack Allocation to prevent GC pauses or allocator lock contention. - Precomputation: Move work from runtime to build-time or startup.
- Wait-Free Sync (Pattern: Avoid Context Switches):
- Mutual Exclusion Problems: Locks (Mutexes) cause expensive OS context switches (~µs).
- Atomics: Use hardware primitives (
CAS,fetch_add) for lock-free state updates. - Wait-Free Structures: Implement Ring Buffers (SPSC) using memory barriers to allow threads to communicate without ever blocking.
- Exploiting Concurrency (Pattern: Use Every Core):
- Thread-per-core: Pin threads to physical cores to maximize cache locality and eliminate scheduler overhead.
- Concurrency Models: Use Coroutines/Fibers for lightweight userspace multitasking or Actor Model for shared-nothing message passing.
- SIMD: Use "Single Instruction, Multiple Data" to parallelize arithmetic at the hardware level.
See references/compute_optimization.md for detailed implementation patterns.
Part 4: Hiding Latency (Perceived Speed)
When you can't make it faster, make it feel faster by masking delays.
- Asynchronous Processing (Pattern: Don't Block the Main Thread):
- Request Hedging: Send the same request to multiple replicas and use the first response to cut tail latency.
- Request Batching: Group small requests to amortize round-trip and header overhead.
- Backpressure: Prevents queuing latency by signaling producers to slow down when the system is saturated.
- Predictive Techniques (Pattern: Guess the Future):
- Prefetching: Load data (Sequential, Spatial, or Semantic based on user intent) before it's explicitly requested.
- Optimistic Updates: Update the UI immediately assuming success; reconcile or rollback if the server fails.
- Speculative Execution (Pattern: Execute Before Needed):
- Parallel Speculation: Execute multiple possible outcomes in parallel and keep the correct one (e.g., in a search engine).
- Prewarming: Spin up resources (Lambdas, VM instances) based on historical traffic patterns before they are needed.
See references/hiding_latency.md for detailed masking strategies.
Bundled Resources
Scripts
Use these for diagnostics and quick calculations:
- scripts/diagnose_latency.py: Interactive diagnostic checklist runner
- scripts/latency_constants.py: Latency constants for ballpark calculations
Code Examples
See code-examples/ for implementations of key techniques from the book.
Latency Constants (Quick Reference)
| Operation | Time | Order |
|---|---|---|
| CPU cycle (3 GHz) | 0.3 ns | 10⁻¹ |
| L1 cache access | 1 ns | 10⁰ |
| DRAM access | 100 ns | 10² |
| NVMe disk access | 10 μs | 10⁴ |
| SSD disk access | 100 μs | 10⁵ |
| Network NYC → London | 60 ms | 10⁷ |
Human Perception
| Perception | Time |
|---|---|
| Immediate (no delay perceived) | < 100 ms |
| Instant (feels fast) | < 1 s |
| Slow | > 10 s |