damage-control
SKILL.md
Table of Contents
- Overview
- When to Use
- Error Categories
- Escalation Ladder
- Quick Reference
- Module Reference
- Integration Pattern
- Exit Criteria
Damage Control
Overview
Adapts service-level error patterns from leyline:error-patterns to agent-level recovery in multi-agent coordination scenarios. Provides structured playbooks for the four most common agent failure modes.
When To Use
- Multi-agent coordination encounters agent failures
- Context windows overflow during complex tasks
- Merge conflicts arise from parallel agent work
- Partial failures need triage and salvage
- Orchestrators need standardized recovery procedures
When NOT To Use
- Single-agent work (use
leyline:error-patternsdirectly) - Service-level errors without agent coordination (API failures, rate limits)
- Simple retry scenarios handled by existing error patterns
Error Categories
Agent-level error categories extend the service-level taxonomy from leyline:error-patterns:
| Agent Category | Service Equivalent | Severity | Recovery |
|---|---|---|---|
| AGENT_CRASH | PERMANENT | Critical | Replace agent, reassign tasks |
| CONTEXT_OVERFLOW | RESOURCE | Error | Graceful handoff, context shed |
| MERGE_CONFLICT | TRANSIENT | Warning | Stop, resolve, resume |
| PARTIAL_FAILURE | Mixed | Varies | Triage by sub-category |
Escalation Ladder
Three-tier escalation with clear handoff points:
Tier 1: Agent Self-Recovery
Agent detects issue, attempts automated recovery
(retry, context shed, conflict resolution)
|
| Fails after 2 attempts
v
Tier 2: Lead Agent Intervention
Lead reassigns tasks, spawns replacement agents,
coordinates resolution across team
|
| Cannot resolve or CRITICAL severity
v
Tier 3: Human Escalation
Human notified with full context, recommended actions,
and current state summary
Quick Reference
| Scenario | Module | First Action |
|---|---|---|
| Agent stops responding | agent-crash-recovery |
Check heartbeat, reassign tasks |
| Response truncation | context-overflow |
Summarize state, create continuation |
| File conflicts after parallel work | merge-conflict-resolution |
Stop agents, lead resolves |
| Some tasks fail, others succeed | partial-failure-handling |
Triage by error category |
Module Reference
- agent-crash-recovery.md: Orphaned task detection, "replace don't wait" doctrine, state recovery
- context-overflow.md: Detection signals, graceful handoff, progressive context shedding
- merge-conflict-resolution.md: Post-execution conflict detection, resolution protocol, prevention
- partial-failure-handling.md: Error triage, per-category recovery, salvage protocol
Integration Pattern
# In your skill's frontmatter
dependencies: [leyline:damage-control]
Consumers should reference specific modules for targeted recovery:
# In orchestrator or coordination skill
On agent crash: follow `leyline:damage-control/modules/agent-crash-recovery.md`
On context overflow: follow `leyline:damage-control/modules/context-overflow.md`
Exit Criteria
- Failed agent identified and replaced (or escalated)
- Orphaned tasks reassigned to healthy agents
- Successful work preserved and committed
- Recovery actions logged for post-mortem
Weekly Installs
1
Repository
athola/claude-n…t-marketGitHub Stars
214
First Seen
Feb 27, 2026
Security Audits
Installed on
amp1
cline1
opencode1
cursor1
continue1
kimi-cli1