chaos-engineering
Discovery Questions
Check .agents/qa-project-context.md first. If it exists, use it as context and skip questions already answered there.
Environment and readiness:
- Where will chaos experiments run? (Pre-production only, production with approval, never production)
- What is the team's monitoring maturity? Can you detect problems in real time?
- Has the team practiced incident response? Is there a runbook?
- Is there executive buy-in for chaos engineering? (Important for production experiments)
Architecture:
- What is the architecture? (Monolith, microservices, serverless, hybrid)
- What are the critical dependencies? (Database, cache, message queue, third-party APIs)
- Are there single points of failure? (Single database, single region, no redundancy)
- What redundancy and failover mechanisms exist?
Current resilience practices:
- Do services have health checks? What do they check?
- Are there circuit breakers, retry logic, or timeout configurations?
- What happens when a dependency is unavailable? (Graceful degradation, hard failure, unknown)
- Have you experienced unexpected outages? What failed?
Team and culture:
- Is the team comfortable with controlled failure? (Anxiety is normal and should be addressed)
- Who would be the chaos engineering champion? (Needs someone to own the practice)
- What is the appetite for starting? (Start small or dive in)
Core Principles
1. Hypothesis-driven: define expected behavior before injecting
Every chaos experiment starts with a hypothesis: "We believe that if [failure X occurs], the system will [expected behavior Y]." Without a hypothesis, you are just breaking things.
Example hypothesis: "We believe that if the primary database becomes unavailable, the application will serve cached data for read requests and queue write requests for up to 5 minutes without user-visible errors."
2. Start small: one service, controlled blast radius
The first chaos experiment should not be "shut down production." It should be "add 200ms latency to one non-critical service in staging." Increase scope gradually as confidence and tooling mature.
3. Monitoring is a prerequisite
If you cannot detect problems in real time, you cannot safely inject failures. Chaos experiments without monitoring are just outages with extra steps. Verify dashboards, alerts, and on-call processes before running any experiment.
4. Game days build muscle memory
Running chaos experiments in automated pipelines is valuable, but game days -- scheduled sessions where the team runs experiments together and practices response -- build the human skills that matter during real incidents.
Chaos Experiment Workflow
Every chaos experiment follows this five-step process.
Step 1: Define steady state hypothesis
Identify the metrics that define "normal" and predict what should happen during the experiment.
Experiment: Database failover
Steady state:
- Error rate: < 0.1%
- P95 latency: < 300ms
- Successful orders per minute: > 50
Hypothesis: When the primary database fails over to the replica,
- Error rate will spike to < 2% for < 30 seconds
- P95 latency will increase to < 1s for < 60 seconds
- No orders will be permanently lost
- The application will recover without manual intervention
Step 2: Introduce the variable
Inject the failure in a controlled way with a clear scope and duration.
Injection:
Target: primary database (PostgreSQL)
Method: block TCP port 5432 on the primary instance
Scope: single database instance
Duration: 60 seconds
Blast radius: staging environment only (first run)
Abort conditions:
- Error rate > 10% for > 2 minutes
- Any data corruption detected
- Manual abort by experiment owner
Step 3: Observe
During the experiment, monitor all relevant metrics in real time. Assign observers to specific dashboards.
Observation assignments:
- Engineer A: application error rate and latency dashboard
- Engineer B: database metrics (connections, replication lag, failover status)
- Engineer C: application logs (search for database connection errors)
- Engineer D: business metrics (order count, payment processing)
Step 4: Analyze recovery and data integrity
After the experiment, analyze what happened versus what was expected.
Analysis checklist:
- Did the system behave as hypothesized? (Y/N, with details)
- How long was the impact? (Expected vs. actual duration)
- Were any errors visible to users?
- Was any data lost or corrupted?
- Did monitoring and alerting detect the problem correctly?
- How long before alerts fired?
- What was the recovery time?
Step 5: Fix and iterate
Document findings, fix resilience gaps, and schedule a re-run to verify the fix.
Findings document:
Experiment: Database failover (2026-03-20)
Hypothesis: Confirmed / Partially confirmed / Disproved
Findings:
- Connection pool did not detect stale connections for 45 seconds (expected: <10s)
- Retry logic worked correctly for read operations
- Write operations returned 500 errors for 38 seconds (expected: queued)
Action items:
- [ ] Configure connection pool health checks (idle connection validation)
- [ ] Implement write queue with 5-minute buffer for database unavailability
- [ ] Re-run experiment after fixes deployed (target: 2026-04-03)
Failure Injection Types
Network failures
| Failure | Tool | Use Case |
|---|---|---|
| Latency injection | tc, toxiproxy, Gremlin | Simulate slow network, distant regions |
| Packet loss | tc netem, Chaos Mesh | Simulate unreliable network |
| DNS failure | iptables, CoreDNS manipulation | Simulate DNS outage |
| Network partition | iptables, Chaos Mesh | Simulate split-brain scenarios |
| Bandwidth restriction | tc, toxiproxy | Simulate congested network |
# Add 200ms latency to all traffic to port 5432 (PostgreSQL)
tc qdisc add dev eth0 root netem delay 200ms 50ms distribution normal
# Add 5% packet loss
tc qdisc add dev eth0 root netem loss 5%
# Remove the injected fault
tc qdisc del dev eth0 root
# toxiproxy configuration for database latency
- name: postgres-latency
listen: 0.0.0.0:15432
upstream: postgres:5432
toxics:
- name: latency
type: latency
attributes:
latency: 200
jitter: 50
Service failures
| Failure | Method | Use Case |
|---|---|---|
| Service crash | Kill process, pod delete | Simulate unexpected crash |
| Service slowdown | CPU stress, thread pool exhaustion | Simulate overloaded service |
| Error injection | Return 500/503, throw exceptions | Simulate application errors |
| Memory pressure | stress-ng, Chaos Mesh | Simulate memory leaks |
# Kill a Kubernetes pod
kubectl delete pod order-service-abc123 --grace-period=0
# Stress CPU on a specific container (via Chaos Mesh)
# chaos-mesh-cpu-stress.yaml
# LitmusChaos: pod kill experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: order-service-chaos
spec:
appinfo:
appns: production
applabel: app=order-service
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: CHAOS_INTERVAL
value: '30'
- name: FORCE
value: 'false'
Infrastructure failures
| Failure | Method | Use Case |
|---|---|---|
| Disk full | fallocate, dd | Simulate disk exhaustion |
| CPU exhaustion | stress-ng | Simulate CPU saturation |
| Memory exhaustion | stress-ng | Simulate OOM conditions |
| Clock skew | chrony manipulation, timedatectl | Simulate time drift |
# Fill disk to trigger disk-full handling
fallocate -l 10G /tmp/fill-disk.dat
# Stress 4 CPU cores for 60 seconds
stress-ng --cpu 4 --timeout 60s
# Consume 2GB of memory
stress-ng --vm 1 --vm-bytes 2G --timeout 60s
# Cleanup
rm /tmp/fill-disk.dat
Dependency failures
| Failure | Method | Use Case |
|---|---|---|
| API down | toxiproxy, mock server | Simulate third-party outage |
| Database unavailable | block port, kill process | Simulate database outage |
| Cache unavailable | block Redis port | Simulate cache miss storm |
| Message queue full | fill queue, block consumers | Simulate backpressure |
// Toxiproxy programmatic control for integration tests
import Toxiproxy from 'toxiproxy-node-client';
const toxiproxy = new Toxiproxy('http://localhost:8474');
test('application handles Redis unavailability gracefully', async () => {
const proxy = await toxiproxy.get('redis');
// Disable Redis
await proxy.disable();
try {
const response = await fetch('http://localhost:3000/api/products');
// Should still work, but slower (cache miss, hits database)
expect(response.ok).toBe(true);
const data = await response.json();
expect(data.products.length).toBeGreaterThan(0);
// Verify degraded performance is within acceptable range
const latency = parseInt(response.headers.get('x-response-time') ?? '0');
expect(latency).toBeLessThan(5000); // 5 seconds max without cache
} finally {
// Always re-enable
await proxy.enable();
}
});
Tools
| Tool | Type | Best For |
|---|---|---|
| LitmusChaos | Kubernetes-native, open source | K8s environments, CI/CD integration |
| Chaos Mesh | Kubernetes-native, CNCF | K8s with fine-grained control |
| Gremlin | Managed platform | Teams wanting guided experiments |
| Chaos Monkey (Netflix) | Service-level, open source | Random instance termination |
| toxiproxy | Network proxy, open source | Network fault injection in tests |
| tc (traffic control) | Linux kernel | Network latency and packet loss |
| stress-ng | Linux utility | CPU, memory, disk stress testing |
| k6 | Load testing tool | Combined load + chaos scenarios |
Choosing a tool
Decision tree:
Running on Kubernetes?
→ Yes: LitmusChaos or Chaos Mesh (native integration)
→ No: Gremlin (managed) or tc + stress-ng (manual)
Need network fault injection in integration tests?
→ toxiproxy (lightweight, programmatic API)
Need to combine load testing with chaos?
→ k6 with xk6-disruptor extension
Need managed platform with UI and compliance?
→ Gremlin (commercial)
Game Day Planning
A game day is a scheduled session where the team runs chaos experiments together, practices incident response, and builds confidence in the system's resilience.
Preparation checklist
2 weeks before:
- [ ] Define 2-3 experiments to run (don't overload the schedule)
- [ ] Write hypotheses for each experiment
- [ ] Get approval from engineering leadership and affected teams
- [ ] Notify support team and stakeholders
- [ ] Verify monitoring and alerting are working
- [ ] Identify rollback procedures for each experiment
- [ ] Schedule 3-4 hour block (experiments + analysis + retro)
1 day before:
- [ ] Confirm all participants and their roles
- [ ] Test that fault injection tools work in the target environment
- [ ] Verify rollback procedures work (dry run)
- [ ] Prepare dashboards and observation assignments
- [ ] Brief the on-call team
- [ ] Confirm abort criteria for each experiment
Communication and roles
Communicate before (schedule, scope, abort authority), during (live updates every 15 minutes in a dedicated channel), and after (summary within 24 hours with findings and action items).
Assign roles per experiment: experiment owner (runs it, makes abort decisions), observers (application metrics, infrastructure metrics, logs, user experience), and a scribe (records timeline and decisions).
Post-game retrospective
For each experiment: was the hypothesis confirmed? What surprised us? What action items do we have? For the process: did monitoring detect problems? Did alerts fire? Were we comfortable with the blast radius? Close with action items (with owners and due dates) and schedule the next game day.
Starting Small: First Three Experiments
For teams new to chaos engineering, start with these three experiments in a pre-production environment.
Experiment 1: Slow database
Why first: Database latency is the most common cause of user-facing slowness, and the experiment is easy to set up and reverse.
Hypothesis: When database latency increases by 500ms, the application
will remain functional with response times under 3 seconds.
Injection: Add 500ms latency to the database connection using toxiproxy.
Duration: 5 minutes.
Environment: staging.
What to observe:
- Application response times (should increase by ~500ms, not 10x)
- Connection pool behavior (should not exhaust connections)
- Timeout handling (requests should not hang indefinitely)
- Circuit breaker activation (if implemented)
- Cache effectiveness (cached reads should be unaffected)
Experiment 2: Third-party API returns 500s
Why second: Third-party dependencies fail regularly, and the application's handling of those failures is often untested.
Hypothesis: When the payment provider returns 500 errors, the
application will show a user-friendly error message and allow
retry without duplicate charges.
Injection: Configure mock/proxy to return 500 for payment API calls.
Duration: 10 minutes.
Environment: staging.
What to observe:
- Error message quality (user-friendly, not stack traces)
- Retry behavior (does the application retry? How many times?)
- Idempotency (retries don't create duplicate transactions)
- Fallback (is there an alternative payment path?)
- Monitoring (does the payment failure show up in alerts?)
Experiment 3: Cache unavailable
Why third: Cache failures cause "thundering herd" problems where all traffic suddenly hits the database, often causing cascading failures.
Hypothesis: When Redis becomes unavailable, the application will fall
back to direct database queries with degraded but functional performance.
Injection: Block Redis port using toxiproxy or iptables.
Duration: 5 minutes.
Environment: staging.
What to observe:
- Database query volume (should increase but not overwhelm)
- Response times (should increase but remain under 5 seconds)
- Error rate (cache miss should not cause errors)
- Connection pool (database connections should not exhaust)
- Recovery (when cache returns, does the application resume normal behavior?)
Anti-Patterns
Chaos without monitoring
Injecting failures without the ability to observe their impact is not chaos engineering -- it is sabotage. You will not know if the experiment revealed a problem until a user complains.
Fix: Before any chaos experiment, verify that you can see error rates, latency, throughput, and dependency health in real time. If you cannot, invest in monitoring first.
Starting too big
The first chaos experiment should not be "kill the production database." Starting with high-impact experiments before the team has practiced with low-impact ones creates anxiety and potential real outages.
Fix: Start with staging. Start with non-critical services. Start with reversible injections (latency, not data corruption). Build confidence gradually. Graduate to production only after multiple successful staging experiments.
No rollback plan
"The experiment is only 60 seconds, we don't need a rollback plan." Then the fault injection tool crashes and the failure persists indefinitely.
Fix: Every experiment must have a documented rollback procedure that can be executed in under 30 seconds. Test the rollback before running the experiment. Have a second person ready to abort if the experiment owner is unable to.
Chaos in production without approval
Running chaos experiments in production without explicit approval from engineering leadership and affected teams destroys trust and careers.
Fix: Production chaos requires explicit, documented approval. Start with staging. Communicate broadly before production experiments. Get buy-in from all affected teams. Make the scope, duration, and abort criteria clear to everyone.
Running chaos experiments during incidents
Adding controlled failures to a system that is already experiencing problems makes diagnosis harder and extends the outage.
Fix: Cancel or postpone chaos experiments if the system is not in steady state. Check for active incidents before starting. If an unrelated incident starts during an experiment, abort the experiment immediately.
No follow-through on findings
The experiment revealed that the circuit breaker does not work correctly. The team says "interesting" and moves on. The finding is never fixed. The next real outage triggers the same failure.
Fix: Every chaos experiment finding gets a ticket with an owner and a due date. Re-run the experiment after the fix to verify. Track the backlog of chaos findings alongside production incident action items.
Done When
- Every experiment has a written hypothesis stating the expected system behavior before any fault is injected
- Blast radius is explicitly bounded (target scope, duration, and abort conditions defined) and experiments are not run directly in production until staging results are stable
- Steady-state baseline metrics are measured immediately before each injection so results have a valid comparison point
- Experiment results are documented with actual vs. expected behavior, recovery time, and whether data integrity was maintained
- Each weakness found has a remediation action item with an assigned owner, due date, and a scheduled re-run to verify the fix
Related Skills
| Skill | Relationship |
|---|---|
performance-testing |
Load testing complements chaos engineering; combine for realistic failure scenarios |
release-readiness |
Chaos experiment results feed into release confidence assessments |
test-environments |
Pre-production environments are the safe starting point for chaos experiments |
testing-in-production |
Production chaos is the advanced stage of testing in production |
observability-driven-testing |
Observability is a prerequisite for chaos engineering |
qa-metrics |
Chaos experiment results (recovery time, error impact) are quality metrics |