error-recovery
Error Recovery Protocol
When an error occurs, stop, think, and try the right recovery strategy. No blind retries — understand the error signal first, then act.
Core principle: Every error carries a signal. Read the signal first, then act.
Error Classification
Classify every error into one of 4 categories — the recovery strategy depends on the category:
Transient Error
Retrying usually fixes it. Infrastructure or network related.
- Examples: timeout, rate limit (429), connection drop, temporary service outage
- Strategy: Wait & Retry with exponential backoff
Configuration Error
Environment or setup issue. Code is correct but setup is wrong.
- Examples: missing env variable, wrong file path, permission denied, missing dependency
- Strategy: Fix & Continue — identify the issue, fix it, re-run
Logic Error
Code or approach is wrong. Retrying produces the same error.
- Examples: KeyError, TypeError, wrong algorithm, expectation mismatch
- Strategy: Alternative Approach — try a different method
Permanent / External Error
Out of control, cannot be fixed. External service or permission boundary.
- Examples: 403 Forbidden, 404 Not Found, quota exceeded, API deprecated
- Strategy: Escalation — inform the user, ask for direction
Retry Strategy
For transient errors, use exponential backoff:
Attempt 1: Retry immediately
Attempt 2: Wait 2 seconds
Attempt 3: Wait 4 seconds
Attempt 4: Wait 8 seconds -> move on or escalate
Maximum retries: 3 attempts. If all 3 fail → re-evaluate the category.
Rate limit (429) special rule:
- If response has
Retry-Afterheader, wait that duration - Otherwise wait 60 seconds, then retry
Decision Tree
Error received
|
Classify the error
|
+------------------------------------+
| Transient? -> Wait & Retry (max 3)|
| Config? -> Fix & Continue |
| Logic? -> Alternative approach|
| Permanent? -> Escalation |
+------------------------------------+
|
Every strategy fails -> Escalation
Escalation Protocol
Escalate to the user when:
- 3 retries failed
- Permanent / external error
- 2 consecutive different strategies failed
- Error category cannot be determined
ERROR ESCALATION
================================
Failed step : [step name]
Error : [error message summary]
Category : [Transient / Config / Logic / Permanent]
Tried : [what was attempted — short list]
Result : All strategies exhausted
================================
Options:
A) [Alternative approach suggestion]
B) [Simpler / partial solution]
C) Skip this step, continue
D) Stop the task
Partial Success
For bulk operations where some items succeed and some fail:
PARTIAL SUCCESS
================================
Successful : N / Total
Failed : M items
================================
Failed items:
- [item]: [reason]
Options:
A) Retry only failed items
B) Continue with successful items, skip failed
C) Cancel all
Error Log
Log every error and recovery attempt:
[ERROR LOG]
Step : [step name / number]
Error : [message]
Category : [type]
Attempt 1: [strategy] -> [result]
Attempt 2: [strategy] -> [result]
Result : Recovered / Escalated
When to Skip
- Error is expected behavior (e.g., "file not found" when checking existence)
- User said "ignore errors, continue"
- One-off, non-repeatable task
Guardrails
- Never blind-retry a logic error — retrying won't help, change the approach.
- Always log every attempt — even successful recoveries need a record.
- Cross-skill: integrates with
checkpoint-guardian(risk assessment before retry),memory-ledger(logs errors and fixes), andagent-reviewer(retrospective analysis).
More from fatih-developer/fth-skills
checkpoint-guardian
Automatic risk assessment before every critical action in agentic workflows. Detects irreversible operations (file deletion, database writes, deployments, payments), classifies risk level, and requires confirmation before proceeding. Triggers on destructive keywords like deploy, delete, send, publish, update database, process payment.
14parallel-planner
Analyze multi-step tasks to identify which steps can run in parallel, build dependency graphs, detect conflicts (write-write, read-write, resource contention), and produce optimized execution plans. Triggers on 3+ independent steps, 'speed up', 'run simultaneously', 'parallelize', 'optimize' or any task where sequential execution wastes time.
14multi-brain
Evaluate complex requests from 3 independent perspectives (Creative, Pragmatic, Comprehensive), reach consensus, then produce complete outputs. Use for architecture decisions, creative content, analysis, and any task where multiple valid approaches exist.
13react-flow
Analyze, repair, migrate, and scaffold @xyflow/react codebases. Use when users ask to debug React Flow behavior, fix node/edge state wiring, improve type safety or performance, upgrade legacy React Flow APIs, preserve persisted graph compatibility, or generate a complete React Flow starter from scratch.
12multi-brain-experts
Replace generic perspectives with domain-specific expert roles selected dynamically per request. Automatically picks the 3 most relevant experts from a role pool (Security, Performance, UX, Cost, DX, Architecture, etc.) based on the task context.
12memory-ledger
Maintain a structured ledger of decisions, discovered bugs and fixes, user preferences, constraints, current status, and failed approaches throughout multi-step agentic tasks. Auto-update after every significant step. Triggers on 'where were we', 'continue', 'summarize status', 'remember', or when a new agent instance takes over a task.
11