skills/yonatangross/orchestkit/resilience-patterns

resilience-patterns

SKILL.md

Resilience Patterns Skill

Production-grade resilience patterns for distributed systems and LLM-based workflows. Covers circuit breakers, bulkheads, retry strategies, and LLM-specific resilience techniques.

Overview

  • Building fault-tolerant multi-agent systems
  • Implementing LLM API integrations with proper error handling
  • Designing distributed workflows that need graceful degradation
  • Adding observability to failure scenarios
  • Protecting systems from cascade failures

Core Patterns

1. Circuit Breaker Pattern (reference: circuit-breaker.md)

Prevents cascade failures by "tripping" when a service exceeds failure thresholds.

+-------------------------------------------------------------------+
|                    Circuit Breaker States                         |
+-------------------------------------------------------------------+
|                                                                   |
|    +----------+     failures >= threshold    +----------+         |
|    |  CLOSED  | ----------------------------> |   OPEN   |        |
|    | (normal) |                              | (reject) |         |
|    +----+-----+                              +----+-----+         |
|         |                                         |               |
|         | success                    timeout      |               |
|         |                            expires      |               |
|         |         +------------+                  |               |
|         |         | HALF_OPEN  |<-----------------+               |
|         +---------+  (probe)   |                                  |
|                   +------------+                                  |
|                                                                   |
|   CLOSED:    Allow requests, count failures                       |
|   OPEN:      Reject immediately, return fallback                  |
|   HALF_OPEN: Allow probe request to test recovery                 |
|                                                                   |
+-------------------------------------------------------------------+

Key Configuration:

  • failure_threshold: Failures before opening (default: 5)
  • recovery_timeout: Seconds before attempting recovery (default: 30)
  • half_open_requests: Probes to allow in half-open (default: 1)

2. Bulkhead Pattern (reference: bulkhead-pattern.md)

Isolates failures by partitioning resources into independent pools.

+-------------------------------------------------------------------+
|                      Bulkhead Isolation                           |
+-------------------------------------------------------------------+
|                                                                   |
|   +------------------+  +------------------+                      |
|   | TIER 1: Critical |  | TIER 2: Standard |                      |
|   |  (5 workers)     |  |  (3 workers)     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  |#| |#| | |     |  |  |#| | | | |     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  | | | |         |  |  Queue: 2        |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  Queue: 0        |  +------------------+                      |
|   +------------------+                                            |
|                                                                   |
|   +------------------+                                            |
|   | TIER 3: Optional |   # = Active request                       |
|   |  (2 workers)     |     = Available slot                       |
|   |  +-+ +-+         |                                            |
|   |  |#| |#| FULL!   |   Tier 1: synthesis, quality_gate          |
|   |  +-+ +-+         |   Tier 2: analysis agents                  |
|   |  Queue: 5        |   Tier 3: enrichment, optional features    |
|   +------------------+                                            |
|                                                                   |
+-------------------------------------------------------------------+

Tier Configuration (OrchestKit):

Tier Workers Queue Timeout Use Case
1 (Critical) 5 10 300s Synthesis, quality gate
2 (Standard) 3 5 120s Content analysis agents
3 (Optional) 2 3 60s Enrichment, caching

3. Retry Strategies (reference: retry-strategies.md)

Intelligent retry logic with exponential backoff and jitter.

+-------------------------------------------------------------------+
|                   Exponential Backoff + Jitter                    |
+-------------------------------------------------------------------+
|                                                                   |
|   Attempt 1:  --> X (fail)                                        |
|               wait: 1s +/- 0.5s                                   |
|                                                                   |
|   Attempt 2:  --> X (fail)                                        |
|               wait: 2s +/- 1s                                     |
|                                                                   |
|   Attempt 3:  --> X (fail)                                        |
|               wait: 4s +/- 2s                                     |
|                                                                   |
|   Attempt 4:  --> OK (success)                                    |
|                                                                   |
|   Formula: delay = min(base * 2^attempt, max_delay) * jitter      |
|   Jitter:  random(0.5, 1.5) to prevent thundering herd            |
|                                                                   |
+-------------------------------------------------------------------+

Error Classification for Retries:

RETRYABLE_ERRORS = {
    # HTTP/Network
    408, 429, 500, 502, 503, 504,  # HTTP status codes
    ConnectionError, TimeoutError,  # Network errors

    # LLM-specific
    "rate_limit_exceeded",
    "model_overloaded",
    "context_length_exceeded",  # Retry with truncation
}

NON_RETRYABLE_ERRORS = {
    400, 401, 403, 404,  # Client errors
    "invalid_api_key",
    "content_policy_violation",
    "invalid_request_error",
}

4. LLM-Specific Resilience (reference: llm-resilience.md)

Patterns specific to LLM API integrations.

+-------------------------------------------------------------------+
|                    LLM Fallback Chain                             |
+-------------------------------------------------------------------+
|                                                                   |
|   Request --> [Primary Model] --success--> Response               |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Fallback Model] --success--> Response              |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Cached Response] --hit--> Response                 |
|                     |                                             |
|                   miss                                            |
|                     v                                             |
|               [Default Response] --> Graceful Degradation         |
|                                                                   |
|   Example Chain:                                                  |
|   1. claude-sonnet-4-5-20251101 (primary)                         |
|   2. gpt-5.2-mini (fallback)                                      |
|   3. Semantic cache lookup                                        |
|   4. "Analysis unavailable" + partial results                     |
|                                                                   |
+-------------------------------------------------------------------+

Token Budget Management:

+-------------------------------------------------------------------+
|                     Token Budget Guard                            |
+-------------------------------------------------------------------+
|                                                                   |
|   Input: 8,000 tokens                                             |
|   +---------------------------------------------+                 |
|   |#################################            |                 |
|   +---------------------------------------------+                 |
|                                          ^                        |
|                                          |                        |
|                                    Context Limit (16K)            |
|                                                                   |
|   Strategy when approaching limit:                                |
|   1. Summarize earlier context (compress 4:1)                     |
|   2. Drop low-priority content (optional fields)                  |
|   3. Split into multiple requests                                 |
|   4. Fail fast with "content too large" error                     |
|                                                                   |
+-------------------------------------------------------------------+

Quick Reference

Pattern When to Use Key Benefit
Circuit Breaker External service calls Prevent cascade failures
Bulkhead Multi-tenant/multi-agent Isolate failures
Retry + Backoff Transient failures Automatic recovery
Fallback Chain Critical operations Graceful degradation
Token Budget LLM calls Cost control, prevent failures

OrchestKit Integration Points

  1. Workflow Agents: Each agent wrapped with circuit breaker + bulkhead tier
  2. LLM Calls: All model invocations use fallback chain + retry logic
  3. External APIs: Circuit breaker on YouTube, arXiv, GitHub APIs
  4. Database Ops: Bulkhead isolation for read vs write operations

Files in This Skill

References (Conceptual Guides)

  • references/circuit-breaker.md - Deep dive on circuit breaker pattern
  • references/bulkhead-pattern.md - Bulkhead isolation strategies
  • references/retry-strategies.md - Retry algorithms and error classification
  • references/llm-resilience.md - LLM-specific patterns
  • references/error-classification.md - How to categorize errors

Templates (Code Patterns)

  • scripts/circuit-breaker.py - Ready-to-use circuit breaker class
  • scripts/bulkhead.py - Semaphore-based bulkhead implementation
  • scripts/retry-handler.py - Configurable retry decorator
  • scripts/llm-fallback-chain.py - Multi-model fallback pattern
  • scripts/token-budget.py - Token budget guard implementation

Examples

  • examples/orchestkit-workflow-resilience.md - Full OrchestKit integration example

Checklists

  • checklists/pre-deployment-resilience.md - Production readiness checklist
  • checklists/circuit-breaker-setup.md - Circuit breaker configuration guide

2026 Best Practices

  1. Adaptive Thresholds: Use sliding windows, not fixed counters
  2. Observability First: Every circuit trip = alert + metric + trace
  3. Graceful Degradation: Always have a fallback, even if partial
  4. Health Endpoints: Separate health check from circuit state
  5. Chaos Testing: Regularly test failure scenarios in staging

Related Skills

  • observability-monitoring - Metrics and alerting for circuit breaker state changes
  • caching-strategies - Cache as fallback layer in degradation scenarios
  • error-handling-rfc9457 - Structured error responses for resilience failures
  • background-jobs - Async processing with retry and failure handling

Key Decisions

Decision Choice Rationale
Circuit breaker recovery Half-open probe Gradual recovery, prevents immediate re-failure
Retry algorithm Exponential backoff + jitter Prevents thundering herd, respects rate limits
Bulkhead isolation Semaphore-based tiers Simple, efficient, prioritizes critical operations
LLM fallback Model chain with cache Graceful degradation, cost optimization, availability

Capability Details

circuit-breaker

Keywords: circuit breaker, failure threshold, cascade failure, trip, half-open Solves:

  • Prevent cascade failures when external services fail
  • Automatically recover when services come back online
  • Fail fast instead of waiting for timeouts

bulkhead

Keywords: bulkhead, isolation, semaphore, thread pool, resource pool, tier Solves:

  • Isolate failures to prevent entire system crashes
  • Prioritize critical operations over optional ones
  • Limit concurrent requests to protect resources

retry-strategies

Keywords: retry, backoff, exponential, jitter, thundering herd Solves:

  • Handle transient failures automatically
  • Avoid overwhelming recovering services
  • Classify errors as retryable vs non-retryable

llm-resilience

Keywords: LLM, fallback, model, token budget, rate limit, context length Solves:

  • Handle LLM API rate limits gracefully
  • Fall back to alternative models when primary fails
  • Manage token budgets to prevent context overflow

error-classification

Keywords: error, retryable, transient, permanent, classification Solves:

  • Determine which errors should be retried
  • Categorize errors by severity and recoverability
  • Map HTTP status codes to resilience actions
Weekly Installs
18
GitHub Stars
92
First Seen
Jan 22, 2026
Installed on
claude-code14
opencode13
codex12
gemini-cli12
antigravity10
cursor9