chaos-plan
SKILL.md
Chaos Plan Command
This command helps design chaos engineering experiments and GameDay plans for a system.
Purpose
Generate comprehensive chaos engineering plans including:
- System resilience assessment
- Failure mode identification
- Experiment hypotheses and designs
- GameDay planning
- Safety measures and rollback procedures
Workflow
Phase 1: System Discovery
First, understand the system:
If a system/service name is provided:
- Search the codebase for service definitions
- Identify dependencies and integrations
- Look for existing resilience patterns (circuit breakers, retries)
- Check for monitoring and alerting configuration
Analyze architecture:
System Analysis Checklist:
□ Service boundaries and responsibilities
□ External dependencies (databases, APIs, queues)
□ Internal service dependencies
□ Data flows and critical paths
□ Current resilience patterns in place
□ Existing monitoring and observability
Phase 2: Failure Mode Identification
Identify potential failure modes:
Infrastructure failures:
- Instance/container crashes
- Network partitions between services
- Disk exhaustion
- Memory pressure
- CPU saturation
Application failures:
- Service unavailability
- Slow responses / latency spikes
- Error responses
- Connection pool exhaustion
- Resource leaks
Dependency failures:
- Database failover/unavailability
- Cache miss/unavailability
- External API timeout/failure
- Message queue backup
Data failures:
- Corrupted data
- Stale data (replication lag)
- Schema incompatibility
Phase 3: Steady State Definition
Define what "healthy" looks like:
Identify key metrics:
Steady State Metrics:
Request-based (RED):
- Request rate: [baseline] requests/sec
- Error rate: < [threshold]%
- Duration (p99): < [threshold]ms
Resource-based (USE):
- CPU utilization: < [threshold]%
- Memory utilization: < [threshold]%
- Queue depth: < [threshold]
Business metrics:
- [Metric 1]: [baseline/threshold]
- [Metric 2]: [baseline/threshold]
Phase 4: Experiment Design
Design experiments for identified failure modes:
For each priority failure mode, create:
Experiment: [Name]
Hypothesis:
"When [fault condition] occurs, [system component] will
[expected behavior] because [reasoning]."
Fault Injection:
- Type: [Latency/Error/Termination/Partition/etc.]
- Target: [Service/instance/dependency]
- Magnitude: [Degree of fault]
- Duration: [How long]
Blast Radius:
- Affected components: [List]
- User impact estimate: [Percentage/description]
Abort Conditions:
- Error rate > [threshold]
- Latency p99 > [threshold]
- [Business metric] breached
- Customer complaints received
Rollback Steps:
1. [Step to revert fault]
2. [Step to verify recovery]
Success Criteria:
□ [Metric] remains within [bounds]
□ [Recovery] happens within [time]
□ [Alerts] fire as expected
Phase 5: Experiment Prioritization
Prioritize experiments by:
Risk-based prioritization:
| Factor | Weight |
|---|---|
| Likelihood of failure | High |
| Impact if it occurs | High |
| Current uncertainty | Medium |
| Ease of testing | Low |
Recommended order:
- High impact + high uncertainty
- High impact + low uncertainty (validate assumptions)
- Medium impact + high uncertainty
- Lower priority items
Phase 6: GameDay Planning
If multiple experiments or team exercise desired:
GameDay Plan: [Title]
Date: [Proposed date]
Duration: [Hours]
Participants: [Teams/roles needed]
Objectives:
1. Validate [resilience pattern/assumption]
2. Practice [incident response/coordination]
3. Test [runbook/recovery procedure]
Pre-GameDay Checklist:
□ Stakeholder approval
□ Participant briefing scheduled
□ Monitoring dashboards verified
□ Kill switches tested
□ Rollback procedures documented
□ Communication channels set up
Schedule:
[Time] - Pre-brief and role assignment
[Time] - Baseline capture
[Time] - Scenario 1: [Name]
[Time] - Debrief / break
[Time] - Scenario 2: [Name]
[Time] - Hot debrief
[Time] - Cleanup and verification
Scenarios:
Scenario 1: [Name]
- Objective: [What we're testing]
- Hypothesis: [Expected behavior]
- Injection: [Fault details]
- Duration: [Time]
- Success criteria: [Metrics]
Scenario 2: [Name]
[Same structure]
Safety:
- Kill switch: [How to immediately stop]
- Rollback: [How to revert all changes]
- Communication: [Primary channel]
- Escalation: [Who to contact if real incident]
Roles:
- GameDay Lead: [Responsibilities]
- Scenario Executor: [Responsibilities]
- Observers: [Responsibilities]
- Scribe: [Responsibilities]
Post-GameDay:
- Hot debrief: Same day
- Formal postmortem: Within 1 week
- Action items tracked in: [System]
Phase 7: Output Generation
Generate deliverables:
- Resilience Assessment - Current state analysis
- Experiment Catalog - Prioritized list of experiments
- Detailed Experiment Plans - Ready-to-execute designs
- GameDay Plan - If requested, full GameDay documentation
- Implementation Checklist - Steps to execute safely
Usage Examples
# Plan chaos for a specific service
/sd:chaos-plan order-service
# Plan with architecture context
/sd:chaos-plan @docs/architecture/payment-system.md
# Plan GameDay for entire system
/sd:chaos-plan "e-commerce platform" --gameday
Interactive Elements
Use AskUserQuestion to:
- Clarify system boundaries
- Validate failure mode priorities
- Confirm blast radius acceptability
- Review experiment designs
- Finalize GameDay parameters
Output
The command produces:
- Resilience Assessment Report
- Prioritized Experiment Catalog
- Detailed Experiment Designs
- GameDay Plan (if applicable)
- Safety Checklist
Related Skills
This command leverages:
chaos-engineering-fundamentals- Experiment design principlesresilience-patterns- Patterns to test and validategameday-planning- GameDay execution guidanceincident-response- Handling discovered issues
Related Agent
For ongoing chaos engineering consultation:
chaos-engineer- Resilience testing expertise
Weekly Installs
1
Repository
melodic-softwar…-pluginsGitHub Stars
38
First Seen
10 days ago
Security Audits
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1