netflix-chaos-engineering
Netflix Chaos Engineering
Overview
Netflix invented chaos engineering in response to their 2008 migration to AWS. Facing the reality that cloud infrastructure fails unpredictably, they created Chaos Monkey—and eventually the entire Simian Army—to proactively inject failures and build confidence in system resilience.
The Pioneers
Casey Rosenthal (Father of Chaos Engineering)
Led Netflix's Chaos Engineering team from 2015, formalizing the discipline and co-authoring the definitive O'Reilly book. Now CEO of Verica. His key insight: chaos engineering is about building confidence, not breaking things.
Nora Jones
Co-pioneered chaos engineering at Netflix, co-authored the book, and later founded Jeli to apply these principles to incident analysis. Emphasized the human factors in resilience.
References
- Book: "Chaos Engineering: System Resiliency in Practice" (O'Reilly, 2020)
- Principles: https://principlesofchaos.org/
- Netflix Tech Blog: https://netflixtechblog.com/
Core Philosophy
"The best way to avoid failure is to fail constantly."
"Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."
"We're not trying to break things. We're trying to build confidence."
Chaos engineering is NOT about breaking things randomly. It's a disciplined approach to discovering systemic weaknesses before they cause outages.
The Principles of Chaos Engineering
1. Build a Hypothesis around Steady State Behavior
Define what "normal" looks like in measurable terms
2. Vary Real-World Events
Inject failures that actually happen: server crashes, network issues, etc.
3. Run Experiments in Production
Staging environments hide real-world complexity
4. Automate Experiments to Run Continuously
One-time tests give false confidence
5. Minimize Blast Radius
Start small, expand as confidence grows
The Simian Army
Netflix's suite of chaos tools:
| Tool | Purpose |
|---|---|
| Chaos Monkey | Randomly terminates instances |
| Chaos Kong | Simulates entire region failure |
| Latency Monkey | Injects artificial delays |
| Conformity Monkey | Finds instances not following best practices |
| Janitor Monkey | Cleans up unused resources |
| Security Monkey | Finds security vulnerabilities |
When Implementing
Always
- Define steady-state hypothesis before experimenting
- Start with smallest blast radius possible
- Have a "stop button" to halt experiments
- Run experiments in production (with safeguards)
- Automate experiments to run continuously
- Involve the whole team, not just SRE
Never
- Inject chaos without a hypothesis
- Start with catastrophic failures
- Run experiments without monitoring
- Chaos without stakeholder buy-in
- Treat chaos as a one-time activity
- Forget to document learnings
Prefer
- Gradual expansion of blast radius
- Automated experiments over manual
- Production over staging (with safeguards)
- Hypothesis-driven experiments
- Business metrics over technical metrics
Implementation Patterns
Chaos Experiment Structure
# chaos_experiment.py
# The anatomy of a chaos experiment
from dataclasses import dataclass
from typing import Callable, Optional
from datetime import datetime, timedelta
import time
@dataclass
class SteadyStateHypothesis:
"""Define what 'normal' looks like"""
name: str
description: str
probe: Callable[[], float] # Returns a metric value
tolerance_min: float
tolerance_max: float
def is_satisfied(self) -> bool:
value = self.probe()
return self.tolerance_min <= value <= self.tolerance_max
@dataclass
class ChaosAction:
"""The failure to inject"""
name: str
description: str
execute: Callable[[], None] # Inject the failure
rollback: Callable[[], None] # Undo the failure
@dataclass
class ChaosExperiment:
"""A complete chaos experiment"""
name: str
description: str
hypothesis: SteadyStateHypothesis
action: ChaosAction
duration_seconds: int
def run(self) -> dict:
results = {
'experiment': self.name,
'started_at': datetime.now().isoformat(),
'hypothesis_before': None,
'hypothesis_during': None,
'hypothesis_after': None,
'success': False
}
# 1. Verify steady state BEFORE
print(f"Checking steady state before experiment...")
results['hypothesis_before'] = self.hypothesis.is_satisfied()
if not results['hypothesis_before']:
print("Steady state not satisfied before experiment. Aborting.")
return results
try:
# 2. Inject the failure
print(f"Injecting chaos: {self.action.name}")
self.action.execute()
# 3. Monitor during experiment
print(f"Monitoring for {self.duration_seconds} seconds...")
time.sleep(self.duration_seconds)
results['hypothesis_during'] = self.hypothesis.is_satisfied()
finally:
# 4. Always rollback
print(f"Rolling back: {self.action.name}")
self.action.rollback()
# 5. Verify steady state AFTER
print("Checking steady state after rollback...")
time.sleep(5) # Allow recovery
results['hypothesis_after'] = self.hypothesis.is_satisfied()
# Success = hypothesis held during and after
results['success'] = (
results['hypothesis_during'] and
results['hypothesis_after']
)
results['completed_at'] = datetime.now().isoformat()
return results
# Example: Test resilience to instance termination
def create_instance_termination_experiment(instance_id: str):
def check_error_rate():
# Query your monitoring system
return get_error_rate_percentage()
def terminate_instance():
# Actually terminate the instance
ec2.terminate_instances(InstanceIds=[instance_id])
def noop_rollback():
# Auto-scaling should replace the instance
pass
hypothesis = SteadyStateHypothesis(
name="Error rate within tolerance",
description="Error rate should remain below 1%",
probe=check_error_rate,
tolerance_min=0,
tolerance_max=1.0
)
action = ChaosAction(
name=f"Terminate instance {instance_id}",
description="Simulate instance failure",
execute=terminate_instance,
rollback=noop_rollback
)
return ChaosExperiment(
name="Instance Termination Resilience",
description="Verify system handles instance loss gracefully",
hypothesis=hypothesis,
action=action,
duration_seconds=300
)
Chaos Monkey Implementation
# chaos_monkey.py
# Simplified Chaos Monkey - random instance termination
import random
import time
from datetime import datetime
from typing import List, Optional
class ChaosMonkey:
"""
Netflix's Chaos Monkey: randomly terminates instances
to ensure services can handle instance failures.
"""
def __init__(self,
cloud_client,
excluded_services: List[str] = None,
probability: float = 0.1,
schedule_start_hour: int = 9,
schedule_end_hour: int = 15):
"""
Args:
cloud_client: AWS/GCP/Azure client
excluded_services: Services to never touch
probability: Chance of termination per run (0-1)
schedule_start_hour: Only run after this hour
schedule_end_hour: Stop running after this hour
"""
self.client = cloud_client
self.excluded = set(excluded_services or [])
self.probability = probability
self.start_hour = schedule_start_hour
self.end_hour = schedule_end_hour
self.termination_log = []
def is_within_schedule(self) -> bool:
"""Only cause chaos during business hours (when humans can respond)"""
hour = datetime.now().hour
weekday = datetime.now().weekday()
# Monday-Friday, 9am-3pm
return weekday < 5 and self.start_hour <= hour < self.end_hour
def get_eligible_instances(self) -> List[dict]:
"""Get instances that can be terminated"""
all_instances = self.client.list_instances()
eligible = []
for instance in all_instances:
service = instance.get('service_name', '')
# Skip excluded services
if service in self.excluded:
continue
# Skip if service has < 2 instances (no redundancy)
service_count = sum(
1 for i in all_instances
if i.get('service_name') == service
)
if service_count < 2:
continue
# Skip if instance is too new (let it warm up)
age_minutes = instance.get('age_minutes', 0)
if age_minutes < 30:
continue
eligible.append(instance)
return eligible
def run(self) -> Optional[dict]:
"""Execute one round of chaos"""
# Check schedule
if not self.is_within_schedule():
return {'action': 'skipped', 'reason': 'outside schedule'}
# Check probability
if random.random() > self.probability:
return {'action': 'skipped', 'reason': 'probability check'}
# Get eligible instances
eligible = self.get_eligible_instances()
if not eligible:
return {'action': 'skipped', 'reason': 'no eligible instances'}
# Select random victim
victim = random.choice(eligible)
# Terminate!
result = {
'action': 'terminated',
'instance_id': victim['id'],
'service': victim.get('service_name'),
'timestamp': datetime.now().isoformat()
}
self.client.terminate_instance(victim['id'])
self.termination_log.append(result)
return result
def run_continuously(self, interval_seconds: int = 300):
"""Run chaos monkey on a schedule"""
print("Chaos Monkey starting... 🐵")
while True:
result = self.run()
if result['action'] == 'terminated':
print(f"🔥 Terminated {result['instance_id']} "
f"({result['service']})")
else:
print(f"😴 Skipped: {result['reason']}")
time.sleep(interval_seconds)
Steady State Metrics
# steady_state.py
# Define and monitor steady state
from dataclasses import dataclass
from typing import List, Callable
from prometheus_client import CollectorRegistry, Gauge
@dataclass
class BusinessMetric:
"""
Netflix insight: measure BUSINESS metrics, not just technical ones.
Users don't care about CPU; they care about streams starting.
"""
name: str
description: str
query: Callable[[], float]
unit: str
# Steady state bounds
min_healthy: float
max_healthy: float
# Netflix's key business metric
streams_per_second = BusinessMetric(
name="streams_starting_per_second",
description="Rate of successful stream starts",
query=lambda: prometheus.query("rate(streams_started_total[1m])"),
unit="streams/sec",
min_healthy=50000,
max_healthy=200000
)
class SteadyStateMonitor:
"""Monitor steady state during chaos experiments"""
def __init__(self, metrics: List[BusinessMetric]):
self.metrics = metrics
self.baseline = {}
def capture_baseline(self, duration_seconds: int = 300):
"""Capture baseline metrics before experiment"""
samples = {m.name: [] for m in self.metrics}
for _ in range(duration_seconds // 10):
for metric in self.metrics:
samples[metric.name].append(metric.query())
time.sleep(10)
# Calculate baseline statistics
for metric in self.metrics:
values = samples[metric.name]
self.baseline[metric.name] = {
'mean': sum(values) / len(values),
'min': min(values),
'max': max(values)
}
def check_steady_state(self) -> dict:
"""Check if all metrics are within healthy bounds"""
results = {}
all_healthy = True
for metric in self.metrics:
current = metric.query()
healthy = metric.min_healthy <= current <= metric.max_healthy
results[metric.name] = {
'current': current,
'healthy_range': (metric.min_healthy, metric.max_healthy),
'is_healthy': healthy
}
if not healthy:
all_healthy = False
results['all_healthy'] = all_healthy
return results
def deviation_from_baseline(self) -> dict:
"""How far are we from baseline?"""
deviations = {}
for metric in self.metrics:
current = metric.query()
baseline = self.baseline.get(metric.name, {}).get('mean', current)
if baseline != 0:
deviation_pct = ((current - baseline) / baseline) * 100
else:
deviation_pct = 0
deviations[metric.name] = {
'current': current,
'baseline': baseline,
'deviation_percent': deviation_pct
}
return deviations
Blast Radius Control
# blast_radius.py
# Control the scope of chaos experiments
from enum import Enum
from dataclasses import dataclass
from typing import List, Optional
class BlastRadius(Enum):
"""Start small, expand as confidence grows"""
SINGLE_INSTANCE = 1 # One instance
SERVICE_PERCENTAGE = 2 # X% of a service
ENTIRE_SERVICE = 3 # All instances of a service
AVAILABILITY_ZONE = 4 # Entire AZ
REGION = 5 # Entire region (Chaos Kong)
@dataclass
class ExperimentScope:
"""Define the scope of an experiment"""
blast_radius: BlastRadius
target_service: Optional[str] = None
target_percentage: float = 0.1
target_az: Optional[str] = None
target_region: Optional[str] = None
def get_targets(self, all_instances: List[dict]) -> List[dict]:
"""Get instances within the blast radius"""
if self.blast_radius == BlastRadius.SINGLE_INSTANCE:
# Just one random instance
import random
eligible = [i for i in all_instances
if i.get('service') == self.target_service]
return [random.choice(eligible)] if eligible else []
elif self.blast_radius == BlastRadius.SERVICE_PERCENTAGE:
# X% of a service
import random
eligible = [i for i in all_instances
if i.get('service') == self.target_service]
count = max(1, int(len(eligible) * self.target_percentage))
return random.sample(eligible, min(count, len(eligible)))
elif self.blast_radius == BlastRadius.ENTIRE_SERVICE:
return [i for i in all_instances
if i.get('service') == self.target_service]
elif self.blast_radius == BlastRadius.AVAILABILITY_ZONE:
return [i for i in all_instances
if i.get('az') == self.target_az]
elif self.blast_radius == BlastRadius.REGION:
# Chaos Kong - nuclear option
return [i for i in all_instances
if i.get('region') == self.target_region]
return []
class GraduatedChaos:
"""Gradually increase blast radius as confidence grows"""
def __init__(self, service: str):
self.service = service
self.current_level = BlastRadius.SINGLE_INSTANCE
self.success_streak = 0
self.required_successes = 5 # Before escalating
def record_result(self, success: bool):
if success:
self.success_streak += 1
if self.success_streak >= self.required_successes:
self.escalate()
else:
self.success_streak = 0
self.de_escalate()
def escalate(self):
"""Increase blast radius"""
levels = list(BlastRadius)
current_idx = levels.index(self.current_level)
if current_idx < len(levels) - 1:
self.current_level = levels[current_idx + 1]
self.success_streak = 0
print(f"Escalating to {self.current_level.name}")
def de_escalate(self):
"""Decrease blast radius after failure"""
levels = list(BlastRadius)
current_idx = levels.index(self.current_level)
if current_idx > 0:
self.current_level = levels[current_idx - 1]
print(f"De-escalating to {self.current_level.name}")
Mental Model
Netflix chaos engineering asks:
- What is steady state? Define normal in measurable terms
- What could go wrong? Real-world failures to simulate
- What's our hypothesis? System should maintain steady state
- How small can we start? Minimize blast radius
- Did we learn something? Every experiment should teach us
Signature Netflix Moves
- Chaos Monkey for random instance termination
- Steady state hypothesis before every experiment
- Business metrics over technical metrics
- Production experiments (with safeguards)
- Graduated blast radius expansion
- Automated, continuous chaos