production-hardening
Production Hardening
Scan a codebase for resilience anti-patterns, generate a prioritized hardening playbook, and optionally implement the fixes.
Announce at start: "I'm using the production-hardening skill."
When to Use
- Before promoting a service to production
- When asked to "harden", "add resilience", or "make production-ready"
- When a post-incident review reveals missing retry/breaker/timeout patterns
- When migrating from prototype to production-grade code
When NOT to Use
- Security vulnerability scanning → use
security-auditor - Performance benchmarking / load testing
- Feature development unrelated to resilience
Modes
| Mode | Trigger | Output |
|---|---|---|
| Analyze | "audit resilience", "review for production" | .opencode/docs/HARDENING-PLAYBOOK.md |
| Implement | "make it resilient", "add retries to X", "harden this" | Code changes + updated playbook |
| Targeted | "add circuit breaker to the EDNA client" | Scoped code change |
Default to Analyze if intent is ambiguous. Always ask before switching to Implement mode on a large codebase.
Phase 1: Discovery
Gather codebase context before scanning. Run these in parallel where possible:
1.1 Detect stack
- Language/runtime: package.json (Node), requirements.txt / pyproject.toml (Python), go.mod (Go), Cargo.toml (Rust)
- IaC framework: sst.config.ts (SST), serverless.yml, cdk.json, terraform/, sam/template.yaml
- Cloud provider: AWS (Lambda, SQS, EventBridge, DynamoDB), GCP, Azure
- Deployment target: Lambda, ECS/Fargate, Kubernetes, bare EC2
1.2 Map external dependencies
Search for all outbound network calls:
Patterns to find:
- axios.get/post/put/delete/patch, fetch(), http.request(), got(), ky()
- new DynamoDBClient, new SQSClient, new EventBridgeClient (AWS SDK)
- gRPC/protobuf clients
- database drivers: pg, mysql2, mongoose, prisma, drizzle, typeorm
- message queue producers/consumers
- Any URL string in env vars or config
Classify each dependency:
| Category | Examples |
|---|---|
| Internal APIs | Other microservices owned by the same team |
| External APIs | Third-party REST/SOAP/GraphQL (EDNA, PeopleSoft, vendor APIs) |
| Databases | DynamoDB, RDS, Redis, Mongo |
| Event buses | EventBridge, SQS, SNS, Kafka |
| Auth providers | JWKS endpoints, OAuth token endpoints, SAML IdPs |
| ML/AI services | SageMaker, Bedrock, self-hosted model endpoints |
1.3 Map error handling
Search for error suppression patterns:
Anti-patterns to flag:
- catch blocks that log but don't rethrow or return typed errors
- catch (e) { return null } or catch (e) { return defaultValue }
- Promise.all without fallback for partial failures
- .catch(() => {}) empty catch
- try/catch wrapping entire function body with generic error
1.4 Identify existing resilience patterns
Check if the codebase already uses any resilience libraries:
Libraries to detect:
- cockatiel (retry, breaker, timeout, bulkhead, wrap)
- opossum (circuit breaker)
- p-retry, p-timeout, p-queue, p-limit
- @aws-lambda-powertools/* (batch, idempotency, parameters, logger)
- axios-retry
- got (built-in retry)
- resilience4j (Java)
- polly (C#)
- Custom implementations: search for "retry", "circuit", "breaker", "backoff", "jitter"
Phase 2: Anti-Pattern Scan
For each external dependency found in Phase 1, check against the anti-pattern
catalog in config/anti-patterns.yaml. Read that file before scanning.
Scan checklist (per dependency call)
| # | Check | Severity | What to look for |
|---|---|---|---|
| 1 | No timeout | CRITICAL | HTTP calls without explicit timeout; default socket hang |
| 2 | No retry | CRITICAL | Single-shot calls to flaky externals; no backoff |
| 3 | Retry without jitter | HIGH | Fixed-interval retry causing thundering herd |
| 4 | Retry without max | HIGH | Unbounded retries burning Lambda duration |
| 5 | No circuit breaker | HIGH | Repeated calls to a known-down dependency |
| 6 | Silent error swallow | CRITICAL | catch that returns fake success / default value |
| 7 | Wrong HTTP status on infra failure | CRITICAL | Network timeout → 401/403 instead of 503 |
| 8 | No DLQ | HIGH | SQS queue without dead-letter queue configured |
| 9 | DLQ without alarm | MEDIUM | DLQ exists but no CloudWatch alarm on depth > 0 |
| 10 | DLQ without redrive | MEDIUM | No operational procedure to redrive DLQ messages |
| 11 | No idempotency | HIGH | Consumer lacks idempotency key; retries cause duplicates |
| 12 | No event replay | MEDIUM | EventBridge without Archive enabled |
| 13 | Unbounded concurrency | MEDIUM | Lambda without reserved concurrency; SQS without maxConcurrency |
| 14 | Missing structured logging | MEDIUM | console.log instead of structured logger |
| 15 | No health probes | MEDIUM | No mechanism to check dependency health before routing |
| 16 | Shared mutable state | HIGH | In-process breaker state in Lambda (cold start resets) |
Severity classification
- CRITICAL: Will cause user-visible outage or data loss in production
- HIGH: Will cause degraded service or incorrect behavior under failure
- MEDIUM: Missing operational capability; increases MTTR
Phase 3: Report (Analyze Mode)
Generate .opencode/docs/HARDENING-PLAYBOOK.md with this structure:
# Production Hardening Playbook — {service-name}
Generated: {date}
Stack: {language} / {iac} / {cloud}
## Audit Summary
| Area | Status | Findings |
|------|--------|----------|
| Timeouts | ❌ / ⚠️ / ✅ | Count |
| Retries | ... | ... |
| Circuit breakers | ... | ... |
| Error handling | ... | ... |
| DLQ / dead letters | ... | ... |
| Idempotency | ... | ... |
| Observability | ... | ... |
| Concurrency limits | ... | ... |
## Findings
### CRITICAL
1. [Finding title]
- **File**: path/to/file.ts#L42
- **Anti-pattern**: [from catalog]
- **Impact**: [what breaks in production]
- **Fix**: [specific code change]
### HIGH
...
### MEDIUM
...
## Recommended Architecture
[Describe the resilience layer architecture: shared package, policy composition,
error taxonomy, state management approach]
## Recommended Libraries
[From config/libraries.yaml — validated, production-grade libraries only]
## Prioritized Implementation Plan
| # | Ticket | Priority | Effort | Depends On |
|---|--------|----------|--------|------------|
| 1 | ... | P0 | S/M/L | — |
## References
Phase 4: Implement (Implement Mode)
When the user asks to implement, follow this order:
4.1 Error taxonomy first
Create a typed error hierarchy that classifies failures:
TransientError → retry (network timeout, 429, 502, 503, 504)
DependencyDownError → circuit breaker (repeated transient failures)
PermanentError → fail fast, no retry (400, 404, 422, auth failure)
Key rule: infrastructure failures (DNS, socket, TLS, JWKS timeout) must NEVER map to client auth errors (401/403). They must map to 503.
4.2 Resilience policy layer
Create a shared package/module containing composable policies:
Node.js/TypeScript preferred stack:
| Library | Purpose | Why |
|---|---|---|
| cockatiel v3.2+ | retry, circuitBreaker, timeout, bulkhead, fallback, wrap | 1M+ weekly downloads, 0 deps, MIT, composable, state serialization via toJSON/initialState |
| @aws-lambda-powertools/batch | SQS partial batch failure reporting | Official AWS, processPartialResponse + ReportBatchItemFailures |
| @aws-lambda-powertools/parameters | SSM parameter caching | maxAge-based cache, no cold-start penalty |
| @aws-lambda-powertools/logger | Structured JSON logging | Correlation IDs, Lambda context auto-capture |
| @aws-lambda-powertools/idempotency | At-most-once processing | DynamoDB-backed, CBOR+SHA256 key generation |
cockatiel policy composition pattern:
import {
retry, handleAll, ExponentialBackoff, circuitBreaker,
ConsecutiveBreaker, timeout, wrap, SamplingBreaker
} from 'cockatiel';
// Per-dependency policy chain
const ednaRetry = retry(handleAll, {
maxAttempts: 3,
backoff: new ExponentialBackoff({ initialDelay: 200, maxDelay: 5_000 }),
});
const ednaBreaker = circuitBreaker(handleAll, {
halfOpenAfter: 30_000,
breaker: new ConsecutiveBreaker(5),
});
const ednaTimeout = timeout(8_000);
// Compose: timeout → retry → breaker (outer to inner)
export const ednaPolicy = wrap(ednaTimeout, ednaRetry, ednaBreaker);
// Use: const result = await ednaPolicy.execute(() => callEdna(params));
Circuit breaker state in serverless: Lambda instances don't share memory. For shared breaker state:
- Use SSM Parameter Store with
@aws-lambda-powertools/parameters(maxAgecaching) - A separate health-probe Lambda writes dependency status to SSM on a schedule
- Application Lambdas read SSM state and use cockatiel's
initialStateto hydrate breakers
DO NOT recommend:
- polly-ts (1 weekly download, v0.1.0, 5 GitHub stars)
- resilience4ts (not published on npm, 0 downloads)
- resilience-typescript (9 weekly downloads, last updated 2021, hard-coupled to Azure)
- Custom retry/breaker implementations (maintenance burden, edge-case bugs)
- In-process setInterval for health probes (dies with Lambda freeze)
4.3 Wrap each dependency
For each external dependency identified in Phase 1:
- Create a typed wrapper that calls through the composed policy
- Classify errors using the taxonomy from 4.1
- Add structured logging on retry, breaker-open, timeout, fallback
- Wire cockatiel lifecycle events to observability (Datadog, CloudWatch, OTEL):
ednaRetry.onRetry(({ attempt }) => tracer.trace('resilience.retry', { resource: 'edna', attempt })); ednaBreaker.onStateChange((state) => tracer.trace('resilience.breaker', { resource: 'edna', state }));
4.4 Infrastructure hardening
| Resource | Hardening |
|---|---|
| SQS queues | Ensure DLQ with maxReceiveCount ≤ 3; add CloudWatch alarm on ApproximateNumberOfMessagesVisible > 0 |
| EventBridge | Enable Archive with retention (14+ days) |
| Lambda | Set reserved concurrency; add SQS maxConcurrency |
| DynamoDB | Check TTL is set on idempotency tables |
| API Gateway | Verify timeout < Lambda timeout (avoid 502 on slow response) |
4.5 Operational tooling
For environments without direct CLI access, implement as GitHub Actions
workflow_dispatch workflows:
| Workflow | Purpose |
|---|---|
dlq-redrive.yml |
Redrive DLQ messages (with dry-run default) |
eventbridge-replay.yml |
Replay archived events by time range |
breaker-force-close.yml |
Force-close a stuck circuit breaker via SSM |
All workflows should use:
- OIDC
aws-actions/configure-aws-credentials@v4(no static keys) - Environment protection rules with required reviewers for production
- Dry-run as default where applicable
4.6 Testing requirements
Every resilience wrapper must have failure-mode tests:
Per wrapper:
- Timeout fires → returns error (not hang)
- Retry exhaustion → surfaces final error
- Breaker opens after N failures → fast-fails
- Breaker half-open → allows probe request
- Fallback activates → returns degraded response
Phase 5: Validation
After implementing changes:
- Run existing tests — ensure no regressions
- Run the scan script — verify findings are resolved:
./scripts/scan.sh /path/to/project - Check for new issues — implementation might introduce new patterns
- Update the playbook — mark findings as resolved, note remaining items
Anti-Pattern Catalog Reference
The full catalog lives in config/anti-patterns.yaml. Key categories:
| Category | Anti-Patterns |
|---|---|
| Timeout | No timeout, timeout > Lambda duration, timeout = 0 (infinite) |
| Retry | No retry, retry without jitter, retry without max, retry non-idempotent POST |
| Circuit Breaker | No breaker on flaky external, in-process state in Lambda, no half-open |
| Error Handling | Silent swallow, wrong HTTP status, catch-all without rethrow, fake success |
| DLQ | No DLQ, DLQ without alarm, DLQ without redrive procedure |
| Events | No archive, silent publish failures, no partial batch reporting |
| Concurrency | No reserved concurrency, no maxConcurrency on SQS trigger |
| Observability | No structured logging, no breaker-state metrics, no retry-count metrics |
| Auth | JWKS timeout → 401 (should be 503), token cache without refresh |
Library Reference
The validated library list lives in config/libraries.yaml. Only recommend
libraries from this list. If a user asks about an unlisted library, research it
before recommending (check npm weekly downloads, GitHub activity, dependency
count, last publish date).
Notes
- This skill complements
security-auditor— security scans for CVEs/secrets, this scans for reliability gaps - The playbook is a point-in-time snapshot; re-run after significant changes
- For serverless (Lambda), pay special attention to cold-start state loss and shared-nothing architecture implications
- When implementing, prefer battle-tested libraries over custom code
- Always validate cockatiel/library versions against npm before recommending