create-runbook
Create Runbook
Overview
Generate an operational runbook with step-by-step procedures, copy-pasteable commands, troubleshooting decision trees, and clear escalation paths. Runbooks are designed to be followed under pressure by an on-call engineer who may not be familiar with the system, so every step must be explicit, every command must be runnable, and every decision point must have clear criteria.
Workflow
-
Read architecture context — Scan
.chalk/docs/engineering/for:- Architecture docs (service names, endpoints, infrastructure components)
- Previous runbooks (to match format and avoid duplication)
- Incident reports and postmortems (to identify procedures that should exist but do not)
- Monitoring and alerting docs (dashboard links, alert names)
Also check
.chalk/docs/root for any configuration or infrastructure documentation.
-
Parse the runbook scope — From
$ARGUMENTS, identify:- Which service or process the runbook covers
- What scenario it addresses (deployment, rollback, scaling, failover, data recovery, etc.)
- Whether this is a routine procedure or an emergency procedure If the scope is too broad (e.g., "runbook for everything"), ask the user to narrow to a specific service or scenario.
-
Gather operational details — Use
Bashto inspect the codebase for:- Service configuration files, deployment scripts, and infrastructure-as-code
- Environment variables and configuration references
- Health check endpoints, monitoring integrations
- Database connection details (without exposing credentials) This ensures commands reference real service names and endpoints, not generic placeholders.
-
Write procedures as numbered steps — Each step must:
- Describe what the step does and why
- Include a copy-pasteable command in a code block (if applicable)
- Show expected output so the operator can verify success
- State what to do if the step fails (go to troubleshooting, retry, or escalate)
-
Build the troubleshooting decision tree — For common failure modes:
- Start with the symptom the operator observes
- Branch based on observable conditions (log messages, status codes, metrics)
- Each branch leads to a specific action or escalation
- No dead ends: every branch must terminate in either a fix or an escalation
-
Define escalation criteria — Specify:
- When to escalate (time thresholds, severity conditions)
- Who to escalate to (team, role, or on-call channel — not individual names)
- What information to include in the escalation
-
Write verification steps — After each procedure, include steps to confirm success:
- Health check commands
- Expected metric values
- Log patterns that confirm normal operation
- How long to monitor before considering the procedure complete
-
Write rollback procedures — For every procedure that changes state, include:
- How to undo the change
- What data or state may be affected by the rollback
- Verification steps specific to the rollback
-
Validate commands — Review all commands in the runbook to ensure:
- No generic placeholders like
<your-service-name>or$SERVICEwithout definition - Environment-specific values are clearly documented at the top as prerequisites
- Commands use actual service names, endpoints, and paths from the codebase
- No generic placeholders like
-
Determine the next file number — List files in
.chalk/docs/engineering/to find the highest numbered file. Increment by 1. -
Write the file — Save to
.chalk/docs/engineering/<n>_runbook_<service_or_process>.md. -
Confirm — Present the runbook with a summary of procedures covered, prerequisites, and any gaps that need input from the team.
Runbook Structure
# Runbook: <Service/Process Name>
**Last Updated**: <YYYY-MM-DD>
**Owner**: <team or role>
**Review Cadence**: <quarterly / after each incident>
## Purpose
<When should an operator use this runbook? What scenario does it address?>
## Prerequisites
Before starting, ensure you have:
- [ ] Access to <system/tool> (request via <channel>)
- [ ] <CLI tool> installed (version <X.Y+>)
- [ ] Environment variables set:
```bash
export SERVICE_URL=<actual-url>
export DB_HOST=<actual-host>
- Familiarity with
Procedures
Procedure 1:
When to use: Estimated duration: Risk level: Low / Medium / High
-
Why:
<copy-pasteable command>Expected output:
<what the operator should see>If this fails: <go to Troubleshooting section X / retry / escalate>
-
<command>Expected output:
<expected result>
Procedure 2:
Troubleshooting
Symptom:
Is <condition A> true?
├── Yes → <Action or next check>
│ └── Did it resolve?
│ ├── Yes → Done. Verify with <command>
│ └── No → Escalate to <team>
└── No → Check <condition B>
├── <condition B> true → <Action>
└── <condition B> false → Escalate to <team>
Symptom:
Escalation
| Condition | Escalate To | Channel | Include |
|---|---|---|---|
| <Slack channel / PagerDuty / etc.> | |||
| Procedure exceeds minutes | Timeline of steps taken, current state | ||
| Data integrity concern | Affected records, scope of impact |
Verification
After completing any procedure, verify success:
-
Health check
<health check command>Expected:
<healthy response> -
Metrics check
- should return to within
- Dashboard:
-
Log check
<command to check for error patterns>Expected: No errors matching
<pattern>in the last -
Monitoring period: Watch for before considering the procedure complete.
Rollback
If the procedure needs to be undone:
-
<rollback command> -
Verify rollback
<verification command>Expected:
Rollback risks:
## Output
- **File**: `.chalk/docs/engineering/<n>_runbook_<service_or_process>.md`
- **Format**: Plain markdown, no YAML frontmatter
- **First line**: `# Runbook: <Service/Process Name>`
## Anti-patterns
- **Prose instead of steps** — "First you'll want to check the service health and then maybe restart it if needed" is not a runbook. "Step 1: Check service health. Step 2: If unhealthy, restart." Runbooks are followed under pressure. Use numbered steps, not paragraphs.
- **Non-copyable commands** — Commands with placeholders like `<your-service>`, `$REPLACE_ME`, or `[insert name here]` force the operator to think and substitute under pressure. Define all variables in the Prerequisites section and use actual values in commands.
- **Missing escalation path** — A runbook without escalation criteria leaves the operator stranded when the procedure does not work. Every runbook must answer: "What do I do if this doesn't fix it?"
- **No verification step** — Completing a procedure without verifying success is dangerous. The operator must be able to confirm the system is healthy before walking away. Include health checks, metric thresholds, and monitoring duration.
- **Outdated commands** — Runbooks that reference decommissioned services, old endpoints, or deprecated CLI flags are worse than no runbook at all. Include a review cadence and last-updated date. Flag commands that depend on specific versions.
- **Missing rollback** — Any procedure that changes system state must include instructions to undo it. If a procedure is irreversible, that must be stated explicitly so the operator understands the risk before proceeding.
- **Assuming expertise** — Runbooks are often used by on-call engineers who did not build the system. Do not assume familiarity with internals. Explain what each step does and why, not just how.
More from generaljerel/chalk-skills
python-clean-architecture
Clean architecture patterns for Python services — service layer, repository pattern, domain models, dependency injection, error hierarchy, and testing strategy
22create-handoff
Generate a handoff document after implementation work is complete — summarizes changes, risks, and review focus areas for the review pipeline. Use when done coding and ready to hand off for review.
16create-review
Bootstrap a local AI review pipeline and generate a paste-ready review prompt for any reviewer agent. Use after creating a handoff or when ready to get an AI code review.
15fix-findings
Fix findings from the active review session — reads reviewer findings files, applies fixes by priority, and updates the resolution log. Use after pasting reviewer output into findings files.
15fix-review
When the user asks to fix, address, or work on PR review comments — fetch review comments from a GitHub pull request and apply fixes to the local codebase. Requires gh CLI.
15review-changes
End-to-end review pipeline — creates a handoff, generates a review (self-review or paste-ready for another provider), then offers to fix findings. Use when you want to review your changes before pushing.
13