chaos-engineering-basics
Chaos Engineering Basics
Overview
Use this skill to design safe, evidence-driven fault injection experiments that verify system resilience under realistic failure conditions.
Scope Boundaries
- Use this skill when the task matches the trigger condition described in
description. - Do not use this skill when the primary task falls outside this skill's domain.
Inputs To Gather
- Critical user journeys and service dependency map.
- Current SLI/SLO and alert signal quality.
- Failure budget, allowed blast radius, and rollback authority.
- Existing runbooks and on-call escalation paths.
Deliverables
- Experiment charter (hypothesis, steady state, blast radius, abort criteria).
- Fault-injection plan (what fails, where, for how long, at what traffic share).
- Observation plan (metrics, logs, traces, and decision thresholds).
- Findings with remediation owners and re-test schedule.
Quick Start Example
Example experiment charter
- Hypothesis: "API p95 remains < 400ms when one cache node fails."
- Steady-state metrics: p95 latency, error rate, queue depth.
- Blast radius: 5% traffic, one AZ only, 10 minutes max.
- Abort immediately if:
- error rate > 2x baseline for 3 minutes,
- user checkout success drops below threshold,
- paging alerts fire in unrelated services.
Example decision rule
pass: steady-state metrics remain inside pre-registered limits.fail: any hard guardrail breaches abort threshold.inconclusive: observability gaps prevent causal interpretation.
Quality Standard
- Steady state is measurable and agreed before injection.
- Abort/rollback conditions are explicit and executable.
- Blast radius is bounded by environment, traffic, and time.
- Experiment outcomes produce owned remediation actions.
- Re-test conditions are defined for failed assumptions.
Workflow
- Select one reliability assumption tied to a business-critical flow.
- Define steady-state metrics and hard guardrails.
- Design smallest useful fault experiment with bounded blast radius.
- Run experiment under live observation with explicit abort authority.
- Classify result (pass/fail/inconclusive) and capture learning.
- Assign remediation and schedule follow-up verification.
Failure Conditions
- Stop when steady-state metric or abort threshold is undefined.
- Stop when observability cannot detect degradation quickly.
- Escalate when proposed blast radius exceeds approved risk budget.
More from kentoshimizu/sw-agent-skills
graph-algorithms
Graph algorithm workflow for modeling entities/relations and selecting traversal, path, ordering, or flow strategies. Use when correctness or performance depends on graph representation and algorithm choice; do not use for schema-only modeling or deployment topology planning.
14bash-style-guide
Style, review, and refactoring standards for Bash shell scripting. Trigger when `.sh` files, files with `#!/usr/bin/env bash` or `#!/bin/bash`, or CI workflow blocks with `shell: bash` are created, modified, or reviewed and Bash-specific quality controls (quoting safety, error handling, portability, readability) must be enforced. Do not use for generic POSIX `sh`, PowerShell, or language-specific application style rules. In multi-language pull requests, run together with other applicable `*-style-guide` skills.
11architecture-clean-architecture
Clean Architecture workflow for enforcing dependency direction, stable domain boundaries, and use-case-centered application design. Use when teams must separate business rules from frameworks and delivery mechanisms; do not use for isolated module cleanup without boundary implications.
11powershell-style-guide
Style, review, and refactoring standards for PowerShell scripting. Trigger when `.ps1`, `.psm1`, `.psd1` files, or CI workflow blocks with `shell: pwsh` or `shell: powershell` are created, modified, or reviewed and PowerShell-specific quality controls (error handling, parameter validation, readability, operational safety) must be enforced. Do not use for Bash, generic POSIX `sh`, or language-specific application style rules. In multi-language pull requests, run together with other applicable `*-style-guide` skills.
10github-codeowners-management
Govern CODEOWNERS rules so review routing reflects real ownership and risk boundaries on GitHub. Use when repository ownership mapping or mandatory reviewer rules must be defined, updated, or audited; do not use for non-GitHub runtime architecture or data-layer design.
9security-authentication
Security workflow for authentication architecture, credential lifecycle, and session/token assurance. Use when login, identity proofing, MFA, or session security decisions are required; do not use for authorization policy design or non-security quality tuning.
9