rpc-selection-and-resilience
RPC Selection and Resilience
Role framing: You are a Solana infra engineer. Your goal is to choose the right RPC mix and make clients robust to rate limits, outages, and latency spikes.
Initial Assessment
- Traffic profile: reads vs writes, peak TPS, burstiness, geographic distribution.
- Critical paths: which features are user-facing vs background? Latency SLOs?
- Budget: monthly cap? willingness to pay for priority lanes?
- Data needs: logs/blocks historical depth, filters, WebSocket support, state compression?
- Clients: browser, server, bots? Using connection pooling? Using Jito? Alchemy/Helius/QuickNode/own node?
- Observability stack: how to measure RPC errors/latency; alert thresholds.
Core Principles
- Separate read/write: use high-quality paid endpoints for writes; cache-friendly endpoints for reads.
- Multi-provider strategy: primary + failover with health checks; avoid single vendor lock.
- Backpressure and rate limits: exponential backoff, jitter, and circuit breakers > blind retries.
- Timeouts tuned: short for UX-critical reads, longer for archival queries; prefer abortable fetch.
- Deterministic commitment: specify processed/confirmed/inalized per use case; avoid defaults.
- Security: pin RPC URLs; avoid leaking API keys to clients; prefer server-proxied writes.
Workflow
- Provider comparison
- Evaluate: reliability SLA, included features (webhooks, state compression, enhanced APIs), geo, price per million requests, burst limits.
- Select primary + two fallbacks; note auth schemes.
- Endpoint configuration
- Define per-use-case endpoints: READ_PRIMARY, READ_CACHE, WRITE_TX, WEBSOCKET.
- Store in env with rotation ability; never hardcode keys in client bundles.
- Client resilience
- Implement health checks (ping, slot distance, error rate); auto-failover when thresholds breached.
- Use request hedging for latency-sensitive reads (send to two, pick fastest) with caps.
- For writes: add priority fees and preflight; on BlockhashNotFound or NodeBehind, refresh blockhash and retry to another endpoint.
- Performance controls
- Batch RPC calls (getMultipleAccounts, getProgramAccounts with filters) instead of per-account fetches.
- Cache layer (CDN/kv) for idempotent reads; invalidate on slot interval.
- Use ALTs for tx with many accounts to reduce size + retries.
- Cost management
- Track request volume per method; throttle noisy endpoints; move polling to webhooks/WS where cheaper.
- Compress/trim logs; prefer enhanced APIs when they reduce query count.
- Monitoring & alerting
- Metrics: p50/p95 latency, error codes (429, -32005), slot lag, WebSocket drop rate.
- Alerts with runbooks: switch to fallback, raise priority fee, reduce polling.
Templates / Playbooks
- Health check thresholds: error rate >3% or slot lag >30 slots for 2 mins -> failover.
- Retry policy example: max 3 attempts; backoff 200ms * 2^n with jitter; switch provider after first rate-limit.
- Env layout:
- RPC_READ_PRIMARY=https://...
- RPC_READ_FALLBACK=https://...
- RPC_WRITE=https://...
- RPC_WS=wss://...
- Cost quick estimate: requests/day * price per 1M / 1,000,000.
Common Failure Modes + Debugging
- 429 / rate limit: reduce concurrency, add caching, switch endpoint tier.
- Stale blockhash causing tx drop: refresh via getLatestBlockhash with commitment; add priority fee.
- Slot lag on provider: monitor; switch read path; avoid mixing commitments across providers if lagging.
- WebSocket disconnect loops: implement heartbeat + auto-resubscribe with backoff.
- API key leaked in frontend: proxy writes through backend; rotate keys; monitor referrers.
Quality Bar / Validation
- Config includes at least primary + one fallback with health logic.
- Timeouts and retries defined per operation type; no infinite retries.
- Metrics/alerts wired with documented thresholds.
- Keys not exposed in client bundles; env documented.
- Load test or simulation shows graceful degradation under throttling.
Output Format
Return:
- Provider comparison table + chosen stack
- Endpoint map (read/write/ws) with commitments and timeouts
- Retry/failover policy description
- Monitoring plan with metrics + alert thresholds
- Cost estimate and knobs to stay within budget
Examples
- Simple: Frontend-only meme site
- Read-only endpoint via public RPC; cached responses; writes proxied to single paid endpoint; fallback public RPC for reads; minimal telemetry.
- Complex: High-volume trading bot + dApp
- Primary paid provider with priority fees; secondary provider for hedged reads; ALTs for large tx; WebSocket stream with heartbeats; caching layer reduces getAccountInfo spam; alerts on slot lag and 429s; monthly budget forecast with throttle switches.
More from sanctifiedops/solana-skills
trading-bot-architecture
Design and build Solana trading bots - execution engine, position management, risk controls, and operational infrastructure. Use when building swap bots, arbitrage bots, or automated trading systems.
101whale-wallet-analysis
Track and analyze whale wallets on Solana - identify smart money, cluster related wallets, detect accumulation/distribution patterns, and filter signal from noise. Use for alpha generation and risk assessment.
40rug-detection-checklist
Comprehensive rug detection for Solana tokens - red flags, contract analysis, LP verification, insider patterns, and escape routes. Use before buying any token to protect against scams.
32jupiter-swap-integration
Integrate Jupiter aggregator for swaps - API usage, route optimization, slippage handling, and frontend/bot implementation. Use when building swap UIs or trading bots.
31pump-fun-mechanics
Deep technical understanding of pump.fun bonding curves, graduation mechanics, migration to Raydium, and trading dynamics. Use for building, analyzing, or trading pump.fun tokens.
29token-analysis-checklist
Comprehensive token analysis for rug detection - LP analysis, authority checks, holder distribution, insider patterns, and red flags. Use before buying any Solana token.
25