Resilience Patterns Skill

Production-grade resilience patterns for distributed systems and LLM-based workflows. Covers circuit breakers, bulkheads, retry strategies, and LLM-specific resilience techniques.

Overview

Building fault-tolerant multi-agent systems
Implementing LLM API integrations with proper error handling
Designing distributed workflows that need graceful degradation
Adding observability to failure scenarios
Protecting systems from cascade failures

Core Patterns

1. Circuit Breaker Pattern (reference: circuit-breaker.md)

Prevents cascade failures by "tripping" when a service exceeds failure thresholds.

+-------------------------------------------------------------------+
|                    Circuit Breaker States                         |
+-------------------------------------------------------------------+
|                                                                   |
|    +----------+     failures >= threshold    +----------+         |
|    |  CLOSED  | ----------------------------> |   OPEN   |        |
|    | (normal) |                              | (reject) |         |
|    +----+-----+                              +----+-----+         |
|         |                                         |               |
|         | success                    timeout      |               |
|         |                            expires      |               |
|         |         +------------+                  |               |
|         |         | HALF_OPEN  |<-----------------+               |
|         +---------+  (probe)   |                                  |
|                   +------------+                                  |
|                                                                   |
|   CLOSED:    Allow requests, count failures                       |
|   OPEN:      Reject immediately, return fallback                  |
|   HALF_OPEN: Allow probe request to test recovery                 |
|                                                                   |
+-------------------------------------------------------------------+

Key Configuration:

failure_threshold: Failures before opening (default: 5)
recovery_timeout: Seconds before attempting recovery (default: 30)
half_open_requests: Probes to allow in half-open (default: 1)

2. Bulkhead Pattern (reference: bulkhead-pattern.md)

Isolates failures by partitioning resources into independent pools.

+-------------------------------------------------------------------+
|                      Bulkhead Isolation                           |
+-------------------------------------------------------------------+
|                                                                   |
|   +------------------+  +------------------+                      |
|   | TIER 1: Critical |  | TIER 2: Standard |                      |
|   |  (5 workers)     |  |  (3 workers)     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  |#| |#| | |     |  |  |#| | | | |     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  | | | |         |  |  Queue: 2        |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  Queue: 0        |  +------------------+                      |
|   +------------------+                                            |
|                                                                   |
|   +------------------+                                            |
|   | TIER 3: Optional |   # = Active request                       |
|   |  (2 workers)     |     = Available slot                       |
|   |  +-+ +-+         |                                            |
|   |  |#| |#| FULL!   |   Tier 1: synthesis, quality_gate          |
|   |  +-+ +-+         |   Tier 2: analysis agents                  |
|   |  Queue: 5        |   Tier 3: enrichment, optional features    |
|   +------------------+                                            |
|                                                                   |
+-------------------------------------------------------------------+

Tier Configuration (OrchestKit):

Tier	Workers	Queue	Timeout	Use Case
1 (Critical)	5	10	300s	Synthesis, quality gate
2 (Standard)	3	5	120s	Content analysis agents
3 (Optional)	2	3	60s	Enrichment, caching

3. Retry Strategies (reference: retry-strategies.md)

Intelligent retry logic with exponential backoff and jitter.

+-------------------------------------------------------------------+
|                   Exponential Backoff + Jitter                    |
+-------------------------------------------------------------------+
|                                                                   |
|   Attempt 1:  --> X (fail)                                        |
|               wait: 1s +/- 0.5s                                   |
|                                                                   |
|   Attempt 2:  --> X (fail)                                        |
|               wait: 2s +/- 1s                                     |
|                                                                   |
|   Attempt 3:  --> X (fail)                                        |
|               wait: 4s +/- 2s                                     |
|                                                                   |
|   Attempt 4:  --> OK (success)                                    |
|                                                                   |
|   Formula: delay = min(base * 2^attempt, max_delay) * jitter      |
|   Jitter:  random(0.5, 1.5) to prevent thundering herd            |
|                                                                   |
+-------------------------------------------------------------------+

Error Classification for Retries:

RETRYABLE_ERRORS = {
    # HTTP/Network
    408, 429, 500, 502, 503, 504,  # HTTP status codes
    ConnectionError, TimeoutError,  # Network errors

    # LLM-specific
    "rate_limit_exceeded",
    "model_overloaded",
    "context_length_exceeded",  # Retry with truncation
}

NON_RETRYABLE_ERRORS = {
    400, 401, 403, 404,  # Client errors
    "invalid_api_key",
    "content_policy_violation",
    "invalid_request_error",
}

4. LLM-Specific Resilience (reference: llm-resilience.md)

Patterns specific to LLM API integrations.

+-------------------------------------------------------------------+
|                    LLM Fallback Chain                             |
+-------------------------------------------------------------------+
|                                                                   |
|   Request --> [Primary Model] --success--> Response               |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Fallback Model] --success--> Response              |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Cached Response] --hit--> Response                 |
|                     |                                             |
|                   miss                                            |
|                     v                                             |
|               [Default Response] --> Graceful Degradation         |
|                                                                   |
|   Example Chain:                                                  |
|   1. claude-sonnet-4-5-20251101 (primary)                         |
|   2. gpt-5.2-mini (fallback)                                      |
|   3. Semantic cache lookup                                        |
|   4. "Analysis unavailable" + partial results                     |
|                                                                   |
+-------------------------------------------------------------------+

Token Budget Management:

+-------------------------------------------------------------------+
|                     Token Budget Guard                            |
+-------------------------------------------------------------------+
|                                                                   |
|   Input: 8,000 tokens                                             |
|   +---------------------------------------------+                 |
|   |#################################            |                 |
|   +---------------------------------------------+                 |
|                                          ^                        |
|                                          |                        |
|                                    Context Limit (16K)            |
|                                                                   |
|   Strategy when approaching limit:                                |
|   1. Summarize earlier context (compress 4:1)                     |
|   2. Drop low-priority content (optional fields)                  |
|   3. Split into multiple requests                                 |
|   4. Fail fast with "content too large" error                     |
|                                                                   |
+-------------------------------------------------------------------+

Quick Reference

Pattern	When to Use	Key Benefit
Circuit Breaker	External service calls	Prevent cascade failures
Bulkhead	Multi-tenant/multi-agent	Isolate failures
Retry + Backoff	Transient failures	Automatic recovery
Fallback Chain	Critical operations	Graceful degradation
Token Budget	LLM calls	Cost control, prevent failures

OrchestKit Integration Points

Workflow Agents: Each agent wrapped with circuit breaker + bulkhead tier
LLM Calls: All model invocations use fallback chain + retry logic
External APIs: Circuit breaker on YouTube, arXiv, GitHub APIs
Database Ops: Bulkhead isolation for read vs write operations

Files in This Skill

References (Conceptual Guides)

references/circuit-breaker.md - Deep dive on circuit breaker pattern
references/bulkhead-pattern.md - Bulkhead isolation strategies
references/retry-strategies.md - Retry algorithms and error classification
references/llm-resilience.md - LLM-specific patterns
references/error-classification.md - How to categorize errors

Templates (Code Patterns)

scripts/circuit-breaker.py - Ready-to-use circuit breaker class
scripts/bulkhead.py - Semaphore-based bulkhead implementation
scripts/retry-handler.py - Configurable retry decorator
scripts/llm-fallback-chain.py - Multi-model fallback pattern
scripts/token-budget.py - Token budget guard implementation

Examples

examples/orchestkit-workflow-resilience.md - Full OrchestKit integration example

Checklists

checklists/pre-deployment-resilience.md - Production readiness checklist
checklists/circuit-breaker-setup.md - Circuit breaker configuration guide

2026 Best Practices

Adaptive Thresholds: Use sliding windows, not fixed counters
Observability First: Every circuit trip = alert + metric + trace
Graceful Degradation: Always have a fallback, even if partial
Health Endpoints: Separate health check from circuit state
Chaos Testing: Regularly test failure scenarios in staging

Related Skills

observability-monitoring - Metrics and alerting for circuit breaker state changes
caching-strategies - Cache as fallback layer in degradation scenarios
error-handling-rfc9457 - Structured error responses for resilience failures
background-jobs - Async processing with retry and failure handling

Key Decisions

Decision	Choice	Rationale
Circuit breaker recovery	Half-open probe	Gradual recovery, prevents immediate re-failure
Retry algorithm	Exponential backoff + jitter	Prevents thundering herd, respects rate limits
Bulkhead isolation	Semaphore-based tiers	Simple, efficient, prioritizes critical operations
LLM fallback	Model chain with cache	Graceful degradation, cost optimization, availability

Capability Details

circuit-breaker

Keywords: circuit breaker, failure threshold, cascade failure, trip, half-open Solves:

Prevent cascade failures when external services fail
Automatically recover when services come back online
Fail fast instead of waiting for timeouts

bulkhead

Keywords: bulkhead, isolation, semaphore, thread pool, resource pool, tier Solves:

Isolate failures to prevent entire system crashes
Prioritize critical operations over optional ones
Limit concurrent requests to protect resources

retry-strategies

Keywords: retry, backoff, exponential, jitter, thundering herd Solves:

Handle transient failures automatically
Avoid overwhelming recovering services
Classify errors as retryable vs non-retryable

llm-resilience

Keywords: LLM, fallback, model, token budget, rate limit, context length Solves:

Handle LLM API rate limits gracefully
Fall back to alternative models when primary fails
Manage token budgets to prevent context overflow

error-classification

Keywords: error, retryable, transient, permanent, classification Solves:

Determine which errors should be retried
Categorize errors by severity and recoverability
Map HTTP status codes to resilience actions

resilience-patterns