chaos-plan

SKILL.md

Chaos Plan Command

This command helps design chaos engineering experiments and GameDay plans for a system.

Purpose

Generate comprehensive chaos engineering plans including:

  1. System resilience assessment
  2. Failure mode identification
  3. Experiment hypotheses and designs
  4. GameDay planning
  5. Safety measures and rollback procedures

Workflow

Phase 1: System Discovery

First, understand the system:

If a system/service name is provided:

  • Search the codebase for service definitions
  • Identify dependencies and integrations
  • Look for existing resilience patterns (circuit breakers, retries)
  • Check for monitoring and alerting configuration

Analyze architecture:

System Analysis Checklist:
□ Service boundaries and responsibilities
□ External dependencies (databases, APIs, queues)
□ Internal service dependencies
□ Data flows and critical paths
□ Current resilience patterns in place
□ Existing monitoring and observability

Phase 2: Failure Mode Identification

Identify potential failure modes:

Infrastructure failures:

  • Instance/container crashes
  • Network partitions between services
  • Disk exhaustion
  • Memory pressure
  • CPU saturation

Application failures:

  • Service unavailability
  • Slow responses / latency spikes
  • Error responses
  • Connection pool exhaustion
  • Resource leaks

Dependency failures:

  • Database failover/unavailability
  • Cache miss/unavailability
  • External API timeout/failure
  • Message queue backup

Data failures:

  • Corrupted data
  • Stale data (replication lag)
  • Schema incompatibility

Phase 3: Steady State Definition

Define what "healthy" looks like:

Identify key metrics:

Steady State Metrics:

Request-based (RED):
- Request rate: [baseline] requests/sec
- Error rate: < [threshold]%
- Duration (p99): < [threshold]ms

Resource-based (USE):
- CPU utilization: < [threshold]%
- Memory utilization: < [threshold]%
- Queue depth: < [threshold]

Business metrics:
- [Metric 1]: [baseline/threshold]
- [Metric 2]: [baseline/threshold]

Phase 4: Experiment Design

Design experiments for identified failure modes:

For each priority failure mode, create:

Experiment: [Name]

Hypothesis:
"When [fault condition] occurs, [system component] will
[expected behavior] because [reasoning]."

Fault Injection:
- Type: [Latency/Error/Termination/Partition/etc.]
- Target: [Service/instance/dependency]
- Magnitude: [Degree of fault]
- Duration: [How long]

Blast Radius:
- Affected components: [List]
- User impact estimate: [Percentage/description]

Abort Conditions:
- Error rate > [threshold]
- Latency p99 > [threshold]
- [Business metric] breached
- Customer complaints received

Rollback Steps:
1. [Step to revert fault]
2. [Step to verify recovery]

Success Criteria:
□ [Metric] remains within [bounds]
□ [Recovery] happens within [time]
□ [Alerts] fire as expected

Phase 5: Experiment Prioritization

Prioritize experiments by:

Risk-based prioritization:

Factor Weight
Likelihood of failure High
Impact if it occurs High
Current uncertainty Medium
Ease of testing Low

Recommended order:

  1. High impact + high uncertainty
  2. High impact + low uncertainty (validate assumptions)
  3. Medium impact + high uncertainty
  4. Lower priority items

Phase 6: GameDay Planning

If multiple experiments or team exercise desired:

GameDay Plan: [Title]

Date: [Proposed date]
Duration: [Hours]
Participants: [Teams/roles needed]

Objectives:
1. Validate [resilience pattern/assumption]
2. Practice [incident response/coordination]
3. Test [runbook/recovery procedure]

Pre-GameDay Checklist:
□ Stakeholder approval
□ Participant briefing scheduled
□ Monitoring dashboards verified
□ Kill switches tested
□ Rollback procedures documented
□ Communication channels set up

Schedule:
[Time] - Pre-brief and role assignment
[Time] - Baseline capture
[Time] - Scenario 1: [Name]
[Time] - Debrief / break
[Time] - Scenario 2: [Name]
[Time] - Hot debrief
[Time] - Cleanup and verification

Scenarios:

Scenario 1: [Name]
- Objective: [What we're testing]
- Hypothesis: [Expected behavior]
- Injection: [Fault details]
- Duration: [Time]
- Success criteria: [Metrics]

Scenario 2: [Name]
[Same structure]

Safety:
- Kill switch: [How to immediately stop]
- Rollback: [How to revert all changes]
- Communication: [Primary channel]
- Escalation: [Who to contact if real incident]

Roles:
- GameDay Lead: [Responsibilities]
- Scenario Executor: [Responsibilities]
- Observers: [Responsibilities]
- Scribe: [Responsibilities]

Post-GameDay:
- Hot debrief: Same day
- Formal postmortem: Within 1 week
- Action items tracked in: [System]

Phase 7: Output Generation

Generate deliverables:

  1. Resilience Assessment - Current state analysis
  2. Experiment Catalog - Prioritized list of experiments
  3. Detailed Experiment Plans - Ready-to-execute designs
  4. GameDay Plan - If requested, full GameDay documentation
  5. Implementation Checklist - Steps to execute safely

Usage Examples

# Plan chaos for a specific service
/sd:chaos-plan order-service

# Plan with architecture context
/sd:chaos-plan @docs/architecture/payment-system.md

# Plan GameDay for entire system
/sd:chaos-plan "e-commerce platform" --gameday

Interactive Elements

Use AskUserQuestion to:

  • Clarify system boundaries
  • Validate failure mode priorities
  • Confirm blast radius acceptability
  • Review experiment designs
  • Finalize GameDay parameters

Output

The command produces:

  1. Resilience Assessment Report
  2. Prioritized Experiment Catalog
  3. Detailed Experiment Designs
  4. GameDay Plan (if applicable)
  5. Safety Checklist

Related Skills

This command leverages:

  • chaos-engineering-fundamentals - Experiment design principles
  • resilience-patterns - Patterns to test and validate
  • gameday-planning - GameDay execution guidance
  • incident-response - Handling discovered issues

Related Agent

For ongoing chaos engineering consultation:

  • chaos-engineer - Resilience testing expertise
Weekly Installs
1
GitHub Stars
38
First Seen
10 days ago
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1