skills/hubeiqiao/skills/multi-provider-llm-proxy-debugging

multi-provider-llm-proxy-debugging

Installation
SKILL.md

Multi-Provider LLM Proxy Chain Debugging

Problem

When building an API proxy that falls back across multiple LLM providers (e.g., Claude Max -> Google AI Studio -> OpenRouter), silent failures can occur where the proxy reports "success" but the downstream consumer (bot) receives error bodies instead of actual LLM responses.

Context / Trigger Conditions

  • Bot responds instantly (<500ms for what should be a multi-second LLM call)
  • Bot produces empty or garbage responses (it parsed a JSON error as the "response")
  • Provider status shows lastSuccess: null despite requests > 0 and circuit: CLOSED
  • Low outputTokens count (e.g., 23 tokens for what should be a paragraph)
  • Fallback chain stops at the wrong provider — first fallback fails but doesn't cascade

Three-Layer Debugging Pattern

Layer 1: Response Routing Logic

Symptom: Provider returns non-2xx but proxy treats it as success.

Root cause: Proxy only retries specific status codes (e.g., 429, 500-504) and pipes all other responses (400, 401, 403, 404) to the consumer as "successful."

Fix: Use a simple 2xx/non-2xx split — only pipe 2xx responses to the consumer. Everything else should drain the response, record a failure, and try the next provider.

// BAD: allowlist of retriable statuses
if (isRetriable(status)) { fallback(); }
else { pipe_to_consumer(); } // 400, 403 etc. get piped as "success"!

// GOOD: only 2xx is success
if (status >= 200 && status < 300) { pipe_to_consumer(); }
else { record_failure(); try_next_provider(); }

Key insight: In a proxy context, the proxy is the "client" to multiple providers. A 400 from Google doesn't mean the user's request is bad — it might mean Google's compatibility layer can't handle a field that OpenRouter can. Always fallback.

Layer 2: URL Path Construction

Symptom: Providers return 404 or unexpected 400 errors.

Root cause: URL double-prefix bug. If incoming requests use /v1/chat/completions and the provider baseUrl includes /api/v1, naive concatenation produces /api/v1/v1/chat/completions.

Fix: Strip the incoming /v1 prefix before prepending the provider's base path.

// BAD: double-prefix
targetUrl.pathname = basePath + targetUrl.pathname;
// Produces: /api/v1/v1/chat/completions

// GOOD: strip incoming prefix first
const strippedPath = reqUrl.replace(/^\/v1(?=\/|$)/, '');
const targetUrl = new URL(basePath + strippedPath, providerUrl.origin);
// Produces: /api/v1/chat/completions

Debug tip: Log the full constructed URL for each provider attempt.

Layer 3: Provider-Specific Field Compatibility

Symptom: URL is correct, but provider returns 400 with field validation error.

Root cause: Different providers have different strictness levels for the OpenAI compatibility layer. Google AI Studio's /v1beta/openai endpoint rejects unknown fields (e.g., store: true from OpenAI's conversation persistence feature). OpenRouter is more lenient.

Fix: Strip provider-incompatible fields in buildProviderBody():

if (provider.name === 'google') {
  delete clone.store;
  // Add other Google-incompatible fields as discovered
}

Known incompatible fields for Google AI Studio:

  • store (OpenAI conversation persistence)
  • Other OpenAI-specific fields may also be rejected

Bonus: Error Body Visibility

Symptom: Error logs show garbled bytes instead of readable error messages.

Root cause: Proxy forwards the original accept-encoding: gzip header, so providers respond with compressed bodies. When you try to read the error body as a string, you get binary garbage.

Fix: Strip accept-encoding from forwarded headers:

delete headers['accept-encoding'];

This ensures error bodies are human-readable in logs AND successful streaming responses are delivered uncompressed (slightly more bandwidth, much better debuggability).

Verification

After fixing, verify with these checks:

  1. Provider status endpoint: At least one provider shows lastSuccess not null
  2. Bot response timing: LLM calls take seconds (real inference), not milliseconds
  3. Bot response content: Actual coherent text, not JSON error fragments
  4. Circuit breaker states: Working provider is CLOSED, failed providers are OPEN

Example: Full Debugging Flow

1. Bot shows empty responses
2. Check /debug/provider-status:
   - claude-proxy: OPEN (3 failures) — expected, tunnel down
   - google: CLOSED, failures:0, requests:3, lastSuccess:null — BUG!
3. Enable error body logging → gzip garbage
4. Strip accept-encoding → "Unknown name 'store': Cannot find field"
5. Strip 'store' field for Google → 200 OK, real response
6. Verify: lastSuccess is set, bot responds with real text

Notes

  • Circuit breaker false negative: A provider can show circuit: CLOSED with lastSuccess: null when all its responses were non-2xx but the proxy piped them as "success" (never called recordFailure). This is the most misleading symptom.
  • Google vs OpenRouter strictness: Google rejects unknown fields; OpenRouter ignores them. When adding new providers, test with the actual request body from your application, not a minimal curl example.
  • Free tier rate limits: OpenRouter free tier keys can hit "Key limit exceeded" (403) even though /models endpoint works fine. This is per-endpoint rate limiting.

References

Weekly Installs
1
First Seen
7 days ago