Chaos Plan Command

This command helps design chaos engineering experiments and GameDay plans for a system.

Purpose

Generate comprehensive chaos engineering plans including:

System resilience assessment
Failure mode identification
Experiment hypotheses and designs
GameDay planning
Safety measures and rollback procedures

Workflow

Phase 1: System Discovery

First, understand the system:

If a system/service name is provided:

Search the codebase for service definitions
Identify dependencies and integrations
Look for existing resilience patterns (circuit breakers, retries)
Check for monitoring and alerting configuration

Analyze architecture:

System Analysis Checklist:
□ Service boundaries and responsibilities
□ External dependencies (databases, APIs, queues)
□ Internal service dependencies
□ Data flows and critical paths
□ Current resilience patterns in place
□ Existing monitoring and observability

Phase 2: Failure Mode Identification

Identify potential failure modes:

Infrastructure failures:

Instance/container crashes
Network partitions between services
Disk exhaustion
Memory pressure
CPU saturation

Application failures:

Service unavailability
Slow responses / latency spikes
Error responses
Connection pool exhaustion
Resource leaks

Dependency failures:

Database failover/unavailability
Cache miss/unavailability
External API timeout/failure
Message queue backup

Data failures:

Corrupted data
Stale data (replication lag)
Schema incompatibility

Phase 3: Steady State Definition

Define what "healthy" looks like:

Identify key metrics:

Steady State Metrics:

Request-based (RED):
- Request rate: [baseline] requests/sec
- Error rate: < [threshold]%
- Duration (p99): < [threshold]ms

Resource-based (USE):
- CPU utilization: < [threshold]%
- Memory utilization: < [threshold]%
- Queue depth: < [threshold]

Business metrics:
- [Metric 1]: [baseline/threshold]
- [Metric 2]: [baseline/threshold]

Phase 4: Experiment Design

Design experiments for identified failure modes:

For each priority failure mode, create:

Experiment: [Name]

Hypothesis:
"When [fault condition] occurs, [system component] will
[expected behavior] because [reasoning]."

Fault Injection:
- Type: [Latency/Error/Termination/Partition/etc.]
- Target: [Service/instance/dependency]
- Magnitude: [Degree of fault]
- Duration: [How long]

Blast Radius:
- Affected components: [List]
- User impact estimate: [Percentage/description]

Abort Conditions:
- Error rate > [threshold]
- Latency p99 > [threshold]
- [Business metric] breached
- Customer complaints received

Rollback Steps:
1. [Step to revert fault]
2. [Step to verify recovery]

Success Criteria:
□ [Metric] remains within [bounds]
□ [Recovery] happens within [time]
□ [Alerts] fire as expected

Phase 5: Experiment Prioritization

Prioritize experiments by:

Risk-based prioritization:

Factor	Weight
Likelihood of failure	High
Impact if it occurs	High
Current uncertainty	Medium
Ease of testing	Low

Recommended order:

High impact + high uncertainty
High impact + low uncertainty (validate assumptions)
Medium impact + high uncertainty
Lower priority items

Phase 6: GameDay Planning

If multiple experiments or team exercise desired:

GameDay Plan: [Title]

Date: [Proposed date]
Duration: [Hours]
Participants: [Teams/roles needed]

Objectives:
1. Validate [resilience pattern/assumption]
2. Practice [incident response/coordination]
3. Test [runbook/recovery procedure]

Pre-GameDay Checklist:
□ Stakeholder approval
□ Participant briefing scheduled
□ Monitoring dashboards verified
□ Kill switches tested
□ Rollback procedures documented
□ Communication channels set up

Schedule:
[Time] - Pre-brief and role assignment
[Time] - Baseline capture
[Time] - Scenario 1: [Name]
[Time] - Debrief / break
[Time] - Scenario 2: [Name]
[Time] - Hot debrief
[Time] - Cleanup and verification

Scenarios:

Scenario 1: [Name]
- Objective: [What we're testing]
- Hypothesis: [Expected behavior]
- Injection: [Fault details]
- Duration: [Time]
- Success criteria: [Metrics]

Scenario 2: [Name]
[Same structure]

Safety:
- Kill switch: [How to immediately stop]
- Rollback: [How to revert all changes]
- Communication: [Primary channel]
- Escalation: [Who to contact if real incident]

Roles:
- GameDay Lead: [Responsibilities]
- Scenario Executor: [Responsibilities]
- Observers: [Responsibilities]
- Scribe: [Responsibilities]

Post-GameDay:
- Hot debrief: Same day
- Formal postmortem: Within 1 week
- Action items tracked in: [System]

Phase 7: Output Generation

Generate deliverables:

Resilience Assessment - Current state analysis
Experiment Catalog - Prioritized list of experiments
Detailed Experiment Plans - Ready-to-execute designs
GameDay Plan - If requested, full GameDay documentation
Implementation Checklist - Steps to execute safely

Usage Examples

# Plan chaos for a specific service
/sd:chaos-plan order-service

# Plan with architecture context
/sd:chaos-plan @docs/architecture/payment-system.md

# Plan GameDay for entire system
/sd:chaos-plan "e-commerce platform" --gameday

Interactive Elements

Use AskUserQuestion to:

Clarify system boundaries
Validate failure mode priorities
Confirm blast radius acceptability
Review experiment designs
Finalize GameDay parameters

Output

The command produces:

Resilience Assessment Report
Prioritized Experiment Catalog
Detailed Experiment Designs
GameDay Plan (if applicable)
Safety Checklist

Related Skills

This command leverages:

chaos-engineering-fundamentals - Experiment design principles
resilience-patterns - Patterns to test and validate
gameday-planning - GameDay execution guidance
incident-response - Handling discovered issues

Related Agent

For ongoing chaos engineering consultation:

chaos-engineer - Resilience testing expertise

chaos-plan

Chaos Plan Command

Purpose

Workflow

Phase 1: System Discovery

Phase 2: Failure Mode Identification

Phase 3: Steady State Definition

Phase 4: Experiment Design

Phase 5: Experiment Prioritization

Phase 6: GameDay Planning

Phase 7: Output Generation

Usage Examples

Interactive Elements

Output

Related Skills

Related Agent

More from melodic-software/claude-code-plugins

design-thinking

plantuml-syntax

system-prompt-engineering

architecture-documentation

data-modeling

resume-optimization