LLM Gateway

A unified API gateway that routes LLM requests across providers and self-hosted models — with rate limiting, cost tracking, caching, and failover.

When to Use This Skill

Use this skill when:

Running multiple LLM backends (OpenAI, Anthropic, vLLM, Ollama) behind a single endpoint
Enforcing per-team or per-user rate limits and spend budgets
Implementing automatic fallback when a provider is down
Adding semantic caching to reduce API costs by 20–50%
Centralizing API key management instead of distributing keys to every app

Prerequisites

Docker and Docker Compose
A PostgreSQL or SQLite database (for LiteLLM state)
LLM API keys (OpenAI, Anthropic, etc.) or self-hosted vLLM endpoints
Optional: Redis for caching and rate limiting

LiteLLM Proxy — Quick Start

LiteLLM is the de facto open-source LLM gateway with OpenAI-compatible API.

# Run with Docker
docker run -d \
  --name litellm-proxy \
  -p 4000:4000 \
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  -v $(pwd)/litellm-config.yaml:/app/config.yaml \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml \
  --detailed_debug

LiteLLM Configuration

# litellm-config.yaml
model_list:
  # OpenAI models
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 10000
      tpm: 2000000

  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

  # Anthropic
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

  # Self-hosted vLLM instances (load balanced)
  - model_name: llama-3.1-8b
    litellm_params:
      model: openai/meta-llama/Llama-3.1-8B-Instruct
      api_base: http://vllm-1:8000/v1
      api_key: fake                    # vLLM key
  - model_name: llama-3.1-8b
    litellm_params:
      model: openai/meta-llama/Llama-3.1-8B-Instruct
      api_base: http://vllm-2:8000/v1  # second replica — auto load balanced
      api_key: fake

  # Fallback: cheap model if primary fails
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o-mini        # fallback to cheaper model
      api_key: os.environ/OPENAI_API_KEY

router_settings:
  routing_strategy: least-busy         # or: latency-based, simple-shuffle
  num_retries: 3
  retry_after: 5
  allowed_fails: 2
  cooldown_time: 60

  # Fallback configuration
  fallbacks:
    - gpt-4o: [claude-sonnet-4-6]
    - claude-sonnet-4-6: [gpt-4o]

litellm_settings:
  # Semantic caching
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379
    similarity_threshold: 0.90        # cache if >90% semantic similarity

  # Logging
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]
  langfuse_public_key: os.environ/LANGFUSE_PUBLIC_KEY
  langfuse_secret_key: os.environ/LANGFUSE_SECRET_KEY

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: postgresql://litellm:password@postgres:5432/litellm
  store_model_in_db: true

Docker Compose: Full Gateway Stack

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    volumes:
      - ./litellm-config.yaml:/app/config.yaml
    ports:
      - "4000:4000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
      - DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
    restart: unless-stopped

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: password
    volumes:
      - postgres-data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U litellm"]
      interval: 5s
      retries: 5
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
    volumes:
      - redis-data:/data
    restart: unless-stopped

volumes:
  postgres-data:
  redis-data:

Virtual Keys & Rate Limiting

# Create a virtual API key for a team (via LiteLLM API)
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "team_id": "team-backend",
    "key_alias": "backend-team-key",
    "models": ["gpt-4o-mini", "llama-3.1-8b"],
    "max_budget": 100,              # USD limit
    "budget_duration": "monthly",
    "rpm_limit": 100,               # requests per minute
    "tpm_limit": 500000             # tokens per minute
  }'

# View spend
curl http://localhost:4000/spend/keys \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY"

Nginx Load Balancer (Alternative/Complement)

# nginx.conf — round-robin across vLLM replicas
upstream vllm_backends {
    least_conn;
    server vllm-1:8000 max_fails=3 fail_timeout=30s;
    server vllm-2:8000 max_fails=3 fail_timeout=30s;
    server vllm-3:8000 max_fails=3 fail_timeout=30s;
    keepalive 32;
}

server {
    listen 80;
    server_name llm-api.internal;

    # Rate limiting
    limit_req_zone $http_authorization zone=per_key:10m rate=100r/m;
    limit_req zone=per_key burst=20 nodelay;

    location /v1/ {
        proxy_pass http://vllm_backends;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_read_timeout 300s;        # long timeout for streaming
        proxy_buffering off;            # required for SSE streaming
        proxy_cache_bypass 1;
    }
}

Monitoring Gateway Health

# Check LiteLLM health
curl http://localhost:4000/health

# Model-level health
curl http://localhost:4000/health/liveliness

# Spend by model
curl http://localhost:4000/spend/models \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY"

# Active virtual keys
curl http://localhost:4000/key/list \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY"

Common Issues

Issue	Cause	Fix
`ConnectionRefusedError` to backend	Backend not reachable	Check `api_base` URL; verify backend is healthy
Rate limit errors (429)	Budget/RPM exceeded	Increase limits or rotate to fallback model
Slow streaming responses	`proxy_buffering` enabled	Set `proxy_buffering off` in Nginx
Cache miss rate high	Threshold too strict	Lower `similarity_threshold` to `0.85`
Postgres connection errors	DB not ready	Add `depends_on` with `condition: service_healthy`

Best Practices

Use virtual keys per team/app — never expose raw provider API keys.
Enable cache: true with Redis for repeated or similar queries; can cut costs 30–50%.
Set num_retries: 3 with fallbacks to handle provider outages gracefully.
Log all requests to Langfuse or OpenTelemetry for cost attribution and debugging.
Use least-busy routing strategy for self-hosted models to avoid GPU saturation.

Related Skills

vllm-server - Backend inference server
llm-inference-scaling - Auto-scaling backends
llm-caching - Semantic cache patterns
llm-cost-optimization - Cost management

llm-gateway