multi-provider-llm-proxy-debugging
Multi-Provider LLM Proxy Chain Debugging
Problem
When building an API proxy that falls back across multiple LLM providers (e.g., Claude Max -> Google AI Studio -> OpenRouter), silent failures can occur where the proxy reports "success" but the downstream consumer (bot) receives error bodies instead of actual LLM responses.
Context / Trigger Conditions
- Bot responds instantly (<500ms for what should be a multi-second LLM call)
- Bot produces empty or garbage responses (it parsed a JSON error as the "response")
- Provider status shows
lastSuccess: nulldespiterequests > 0andcircuit: CLOSED - Low
outputTokenscount (e.g., 23 tokens for what should be a paragraph) - Fallback chain stops at the wrong provider — first fallback fails but doesn't cascade
Three-Layer Debugging Pattern
Layer 1: Response Routing Logic
Symptom: Provider returns non-2xx but proxy treats it as success.
Root cause: Proxy only retries specific status codes (e.g., 429, 500-504) and pipes all other responses (400, 401, 403, 404) to the consumer as "successful."
Fix: Use a simple 2xx/non-2xx split — only pipe 2xx responses to the consumer. Everything else should drain the response, record a failure, and try the next provider.
// BAD: allowlist of retriable statuses
if (isRetriable(status)) { fallback(); }
else { pipe_to_consumer(); } // 400, 403 etc. get piped as "success"!
// GOOD: only 2xx is success
if (status >= 200 && status < 300) { pipe_to_consumer(); }
else { record_failure(); try_next_provider(); }
Key insight: In a proxy context, the proxy is the "client" to multiple providers. A 400 from Google doesn't mean the user's request is bad — it might mean Google's compatibility layer can't handle a field that OpenRouter can. Always fallback.
Layer 2: URL Path Construction
Symptom: Providers return 404 or unexpected 400 errors.
Root cause: URL double-prefix bug. If incoming requests use /v1/chat/completions
and the provider baseUrl includes /api/v1, naive concatenation produces
/api/v1/v1/chat/completions.
Fix: Strip the incoming /v1 prefix before prepending the provider's base path.
// BAD: double-prefix
targetUrl.pathname = basePath + targetUrl.pathname;
// Produces: /api/v1/v1/chat/completions
// GOOD: strip incoming prefix first
const strippedPath = reqUrl.replace(/^\/v1(?=\/|$)/, '');
const targetUrl = new URL(basePath + strippedPath, providerUrl.origin);
// Produces: /api/v1/chat/completions
Debug tip: Log the full constructed URL for each provider attempt.
Layer 3: Provider-Specific Field Compatibility
Symptom: URL is correct, but provider returns 400 with field validation error.
Root cause: Different providers have different strictness levels for the OpenAI
compatibility layer. Google AI Studio's /v1beta/openai endpoint rejects unknown
fields (e.g., store: true from OpenAI's conversation persistence feature).
OpenRouter is more lenient.
Fix: Strip provider-incompatible fields in buildProviderBody():
if (provider.name === 'google') {
delete clone.store;
// Add other Google-incompatible fields as discovered
}
Known incompatible fields for Google AI Studio:
store(OpenAI conversation persistence)- Other OpenAI-specific fields may also be rejected
Bonus: Error Body Visibility
Symptom: Error logs show garbled bytes instead of readable error messages.
Root cause: Proxy forwards the original accept-encoding: gzip header, so
providers respond with compressed bodies. When you try to read the error body
as a string, you get binary garbage.
Fix: Strip accept-encoding from forwarded headers:
delete headers['accept-encoding'];
This ensures error bodies are human-readable in logs AND successful streaming responses are delivered uncompressed (slightly more bandwidth, much better debuggability).
Verification
After fixing, verify with these checks:
- Provider status endpoint: At least one provider shows
lastSuccessnot null - Bot response timing: LLM calls take seconds (real inference), not milliseconds
- Bot response content: Actual coherent text, not JSON error fragments
- Circuit breaker states: Working provider is CLOSED, failed providers are OPEN
Example: Full Debugging Flow
1. Bot shows empty responses
2. Check /debug/provider-status:
- claude-proxy: OPEN (3 failures) — expected, tunnel down
- google: CLOSED, failures:0, requests:3, lastSuccess:null — BUG!
3. Enable error body logging → gzip garbage
4. Strip accept-encoding → "Unknown name 'store': Cannot find field"
5. Strip 'store' field for Google → 200 OK, real response
6. Verify: lastSuccess is set, bot responds with real text
Notes
- Circuit breaker false negative: A provider can show
circuit: CLOSEDwithlastSuccess: nullwhen all its responses were non-2xx but the proxy piped them as "success" (never calledrecordFailure). This is the most misleading symptom. - Google vs OpenRouter strictness: Google rejects unknown fields; OpenRouter ignores them. When adding new providers, test with the actual request body from your application, not a minimal curl example.
- Free tier rate limits: OpenRouter free tier keys can hit "Key limit exceeded"
(403) even though
/modelsendpoint works fine. This is per-endpoint rate limiting.