testing-chaos
SKILL.md
Chaos Testing
Run chaos engineering experiments to build resilient systems by intentionally injecting failures and observing system behavior.
When to use me
Use this skill when:
- Building highly available and resilient systems
- Testing failure recovery and auto-remediation
- Validating disaster recovery plans
- Ensuring graceful degradation under stress
- Testing monitoring and alerting systems
- Building confidence in production resilience
- Preparing for unexpected failure scenarios
What I do
-
Failure injection experiments:
- Network latency and packet loss
- Service dependency failures
- Resource exhaustion (CPU, memory, disk)
- Database connection failures
- Third-party API outages
-
Resilience validation:
- Circuit breaker pattern testing
- Retry and backoff strategy validation
- Fallback and default behavior testing
- Load shedding and rate limiting
- Failover and redundancy testing
-
Coordination with other testing:
- Builds on performance and load testing
- Complements disaster recovery testing
- Informs reliability and availability testing
- Validates monitoring and observability
-
Experiment design:
- Hypothesis-driven experimentation
- Blast radius containment
- Progressive fault injection
- Automated experiment orchestration
Examples
# Chaos engineering tools
npm run test:chaos:start # Start chaos experiments
npm run test:chaos:stop # Stop all chaos experiments
npm run test:chaos:status # Check experiment status
# Specific failure injections
npm run test:chaos:network # Network failure injection
npm run test:chaos:service # Service dependency failures
npm run test:chaos:resource # Resource exhaustion
npm run test:chaos:database # Database failures
# Integration with other tests
npm run test:performance -- --chaos # Performance under failure
npm run test:reliability -- --chaos # Reliability with faults
# Experiment scenarios
npm run test:chaos:scenario:api-outage # API dependency outage
npm run test:chaos:scenario:db-failover # Database failover
npm run test:chaos:scenario:latency-spike # Network latency spike
npm run test:chaos:scenario:memory-leak # Memory pressure
# Safety controls
npm run test:chaos:safety-check # Pre-experiment safety check
npm run test:chaos:rollback # Emergency rollback
Output format
Chaos Test Results:
──────────────────────────────
Experiment: Database Primary Node Failure
Hypothesis: System will failover to replica within 30 seconds
Blast Radius: Staging environment, canary deployment
Duration: 15 minutes
Experiment Execution:
1. Baseline metrics collected
2. Database primary node terminated (simulated)
3. System behavior observed for 10 minutes
4. Metrics compared to baseline
Results:
✅ Failover Time: 22 seconds (within 30s target)
✅ Data Consistency: No data loss detected
✅ User Impact: 15% error rate during failover (acceptable)
✅ Recovery: Automatic, no manual intervention required
✅ Monitoring: Alerts triggered within 45 seconds
System Behavior Under Failure:
- API response time increased from 150ms to 850ms during failover
- Error rate spiked to 15% for 45 seconds
- Read-only operations continued uninterrupted
- Write operations queued and retried successfully
Integration with Other Testing:
- Performance testing: Established baseline for comparison
- Reliability testing: Validated MTTR (Mean Time To Recovery)
- Monitoring testing: Alert effectiveness verified
- Disaster recovery: Automated failover confirmed
Safety Controls:
- Automatic rollback on critical failure thresholds
- Manual abort available throughout
- Canary deployment limited impact
- Post-experiment verification passed
Lessons Learned:
1. Need to improve connection pooling during failover
2. Alert thresholds should be adjusted for transient failures
3. User-facing error messages during failover need improvement
4. Database health checks could be more frequent
Recommendation:
- System demonstrates good resilience to database failures
- Implement suggested improvements from lessons learned
- Schedule regular chaos experiments (monthly)
- Expand blast radius gradually as confidence increases
Notes
- Start with small, controlled experiments
- Always have a rollback plan and automatic abort mechanisms
- Test in staging before production
- Document hypotheses and validate outcomes
- Use feature flags to control chaos injection
- Monitor system metrics closely during experiments
- Learn from failures and improve system design
- Chaos testing complements, doesn't replace, other testing
- Consider business impact and schedule experiments appropriately
- Build a culture of resilience, not just technical fixes
Weekly Installs
16
Repository
wojons/skillsGitHub Stars
1
First Seen
Feb 28, 2026
Security Audits
Installed on
github-copilot16
codex16
kimi-cli16
gemini-cli16
cursor16
amp16