llm-gateway
SKILL.md
LLM Gateway
A unified API gateway that routes LLM requests across providers and self-hosted models — with rate limiting, cost tracking, caching, and failover.
When to Use This Skill
Use this skill when:
- Running multiple LLM backends (OpenAI, Anthropic, vLLM, Ollama) behind a single endpoint
- Enforcing per-team or per-user rate limits and spend budgets
- Implementing automatic fallback when a provider is down
- Adding semantic caching to reduce API costs by 20–50%
- Centralizing API key management instead of distributing keys to every app
Prerequisites
- Docker and Docker Compose
- A PostgreSQL or SQLite database (for LiteLLM state)
- LLM API keys (OpenAI, Anthropic, etc.) or self-hosted vLLM endpoints
- Optional: Redis for caching and rate limiting
LiteLLM Proxy — Quick Start
LiteLLM is the de facto open-source LLM gateway with OpenAI-compatible API.
# Run with Docker
docker run -d \
--name litellm-proxy \
-p 4000:4000 \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
-v $(pwd)/litellm-config.yaml:/app/config.yaml \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml \
--detailed_debug
LiteLLM Configuration
# litellm-config.yaml
model_list:
# OpenAI models
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 10000
tpm: 2000000
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
# Anthropic
- model_name: claude-sonnet-4-6
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
# Self-hosted vLLM instances (load balanced)
- model_name: llama-3.1-8b
litellm_params:
model: openai/meta-llama/Llama-3.1-8B-Instruct
api_base: http://vllm-1:8000/v1
api_key: fake # vLLM key
- model_name: llama-3.1-8b
litellm_params:
model: openai/meta-llama/Llama-3.1-8B-Instruct
api_base: http://vllm-2:8000/v1 # second replica — auto load balanced
api_key: fake
# Fallback: cheap model if primary fails
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o-mini # fallback to cheaper model
api_key: os.environ/OPENAI_API_KEY
router_settings:
routing_strategy: least-busy # or: latency-based, simple-shuffle
num_retries: 3
retry_after: 5
allowed_fails: 2
cooldown_time: 60
# Fallback configuration
fallbacks:
- gpt-4o: [claude-sonnet-4-6]
- claude-sonnet-4-6: [gpt-4o]
litellm_settings:
# Semantic caching
cache: true
cache_params:
type: redis
host: redis
port: 6379
similarity_threshold: 0.90 # cache if >90% semantic similarity
# Logging
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
langfuse_public_key: os.environ/LANGFUSE_PUBLIC_KEY
langfuse_secret_key: os.environ/LANGFUSE_SECRET_KEY
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: postgresql://litellm:password@postgres:5432/litellm
store_model_in_db: true
Docker Compose: Full Gateway Stack
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
command: ["--config", "/app/config.yaml", "--port", "4000"]
volumes:
- ./litellm-config.yaml:/app/config.yaml
ports:
- "4000:4000"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
- DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_started
restart: unless-stopped
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: litellm
POSTGRES_USER: litellm
POSTGRES_PASSWORD: password
volumes:
- postgres-data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U litellm"]
interval: 5s
retries: 5
restart: unless-stopped
redis:
image: redis:7-alpine
command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
volumes:
- redis-data:/data
restart: unless-stopped
volumes:
postgres-data:
redis-data:
Virtual Keys & Rate Limiting
# Create a virtual API key for a team (via LiteLLM API)
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"team_id": "team-backend",
"key_alias": "backend-team-key",
"models": ["gpt-4o-mini", "llama-3.1-8b"],
"max_budget": 100, # USD limit
"budget_duration": "monthly",
"rpm_limit": 100, # requests per minute
"tpm_limit": 500000 # tokens per minute
}'
# View spend
curl http://localhost:4000/spend/keys \
-H "Authorization: Bearer $LITELLM_MASTER_KEY"
Nginx Load Balancer (Alternative/Complement)
# nginx.conf — round-robin across vLLM replicas
upstream vllm_backends {
least_conn;
server vllm-1:8000 max_fails=3 fail_timeout=30s;
server vllm-2:8000 max_fails=3 fail_timeout=30s;
server vllm-3:8000 max_fails=3 fail_timeout=30s;
keepalive 32;
}
server {
listen 80;
server_name llm-api.internal;
# Rate limiting
limit_req_zone $http_authorization zone=per_key:10m rate=100r/m;
limit_req zone=per_key burst=20 nodelay;
location /v1/ {
proxy_pass http://vllm_backends;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_read_timeout 300s; # long timeout for streaming
proxy_buffering off; # required for SSE streaming
proxy_cache_bypass 1;
}
}
Monitoring Gateway Health
# Check LiteLLM health
curl http://localhost:4000/health
# Model-level health
curl http://localhost:4000/health/liveliness
# Spend by model
curl http://localhost:4000/spend/models \
-H "Authorization: Bearer $LITELLM_MASTER_KEY"
# Active virtual keys
curl http://localhost:4000/key/list \
-H "Authorization: Bearer $LITELLM_MASTER_KEY"
Common Issues
| Issue | Cause | Fix |
|---|---|---|
ConnectionRefusedError to backend |
Backend not reachable | Check api_base URL; verify backend is healthy |
| Rate limit errors (429) | Budget/RPM exceeded | Increase limits or rotate to fallback model |
| Slow streaming responses | proxy_buffering enabled |
Set proxy_buffering off in Nginx |
| Cache miss rate high | Threshold too strict | Lower similarity_threshold to 0.85 |
| Postgres connection errors | DB not ready | Add depends_on with condition: service_healthy |
Best Practices
- Use virtual keys per team/app — never expose raw provider API keys.
- Enable
cache: truewith Redis for repeated or similar queries; can cut costs 30–50%. - Set
num_retries: 3with fallbacks to handle provider outages gracefully. - Log all requests to Langfuse or OpenTelemetry for cost attribution and debugging.
- Use
least-busyrouting strategy for self-hosted models to avoid GPU saturation.
Related Skills
- vllm-server - Backend inference server
- llm-inference-scaling - Auto-scaling backends
- llm-caching - Semantic cache patterns
- llm-cost-optimization - Cost management
Weekly Installs
3
Repository
bagelhole/devop…t-skillsGitHub Stars
13
First Seen
5 days ago
Security Audits
Installed on
opencode3
antigravity3
claude-code3
github-copilot3
codex3
zencoder3