multi-tenant-llm-hosting
Installation
SKILL.md
Multi-Tenant LLM Hosting
Host many teams/customers on shared inference infrastructure without sacrificing security, performance, or cost governance.
When to Use This Skill
- Building an internal LLM platform shared by multiple teams
- Hosting LLM inference for external customers with isolation requirements
- Implementing per-tenant quotas, billing, and rate limiting
- Designing request routing for multi-model, multi-tenant environments
- Preventing noisy-neighbor issues on shared GPU infrastructure
Prerequisites
- Kubernetes cluster with GPU node pools
- API gateway or LLM gateway (LiteLLM, Envoy, Kong)
- Prometheus + Grafana for per-tenant observability
- Redis or equivalent for rate limiting state
- Billing system or cost attribution database
Isolation Model
- Strong tenant identity on every request
- Per-tenant API keys and scoped model access
- Namespace or workload isolation for high-risk tenants
- Strict data retention and log partitioning controls
vLLM Multi-Model Serving
# vllm-deployment.yaml - Multi-model serving with vLLM
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-gpt4o-equivalent
namespace: llm-serving
labels:
app: vllm
model-tier: premium
spec:
replicas: 3
selector:
matchLabels:
app: vllm
model-tier: premium
template:
metadata:
labels:
app: vllm
model-tier: premium
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.4.1
args:
- "--model=/models/llama-3.1-70b"
- "--tensor-parallel-size=2"
- "--max-model-len=8192"
- "--gpu-memory-utilization=0.90"
- "--max-num-seqs=128"
- "--enable-prefix-caching"
ports:
- containerPort: 8000
name: inference
- containerPort: 8080
name: metrics
resources:
requests:
nvidia.com/gpu: 2
cpu: "8"
memory: "64Gi"
limits:
nvidia.com/gpu: 2
cpu: "16"
memory: "128Gi"
volumeMounts:
- name: model-weights
mountPath: /models
readOnly: true
volumes:
- name: model-weights
persistentVolumeClaim:
claimName: premium-model-weights
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
gpu-type: a100
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-economy
namespace: llm-serving
labels:
app: vllm
model-tier: economy
spec:
replicas: 2
selector:
matchLabels:
app: vllm
model-tier: economy
template:
metadata:
labels:
app: vllm
model-tier: economy
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.4.1
args:
- "--model=/models/llama-3.1-8b"
- "--max-model-len=4096"
- "--gpu-memory-utilization=0.85"
- "--max-num-seqs=256"
- "--enable-prefix-caching"
ports:
- containerPort: 8000
name: inference
- containerPort: 8080
name: metrics
resources:
requests:
nvidia.com/gpu: 1
cpu: "4"
memory: "32Gi"
limits:
nvidia.com/gpu: 1
cpu: "8"
memory: "64Gi"
volumeMounts:
- name: model-weights
mountPath: /models
readOnly: true
volumes:
- name: model-weights
persistentVolumeClaim:
claimName: economy-model-weights
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Per-Tenant Quota Configuration
# tenant-quotas-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tenant-quotas
namespace: llm-serving
data:
quotas.yaml: |
tenants:
acme-corp:
tier: enterprise
models_allowed:
- llama-3.1-70b
- llama-3.1-8b
- nomic-embed-text
rate_limits:
requests_per_minute: 300
tokens_per_minute: 500000
concurrent_requests: 50
budget:
daily_limit_usd: 500.00
monthly_limit_usd: 10000.00
alert_threshold_percent: 80
priority: high
startup-xyz:
tier: standard
models_allowed:
- llama-3.1-8b
- nomic-embed-text
rate_limits:
requests_per_minute: 60
tokens_per_minute: 100000
concurrent_requests: 10
budget:
daily_limit_usd: 50.00
monthly_limit_usd: 1000.00
alert_threshold_percent: 80
priority: medium
internal-dev:
tier: free
models_allowed:
- llama-3.1-8b
rate_limits:
requests_per_minute: 20
tokens_per_minute: 50000
concurrent_requests: 5
budget:
daily_limit_usd: 10.00
monthly_limit_usd: 200.00
alert_threshold_percent: 90
priority: low
Namespace Isolation for High-Risk Tenants
# tenant-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: tenant-acme-corp
labels:
tenant: acme-corp
isolation: strict
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: tenant-isolation
namespace: tenant-acme-corp
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: llm-gateway
egress:
- to:
- namespaceSelector:
matchLabels:
name: llm-serving
ports:
- port: 8000
protocol: TCP
- to:
- namespaceSelector:
matchLabels:
name: kube-dns
ports:
- port: 53
protocol: UDP
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-quota
namespace: tenant-acme-corp
spec:
hard:
requests.cpu: "16"
requests.memory: "64Gi"
limits.cpu: "32"
limits.memory: "128Gi"
requests.nvidia.com/gpu: "4"
pods: "20"
Request Routing and Rate Limiting
# gateway_router.py
"""Multi-tenant request router with rate limiting and model routing."""
import time
import json
import redis
from fastapi import FastAPI, HTTPException, Header, Request
from typing import Optional
import httpx
import yaml
app = FastAPI()
redis_client = redis.Redis(host="redis", port=6379, decode_responses=True)
# Load tenant config
with open("/etc/config/quotas.yaml") as f:
TENANT_CONFIG = yaml.safe_load(f)["tenants"]
MODEL_ENDPOINTS = {
"llama-3.1-70b": "http://vllm-gpt4o-equivalent:8000",
"llama-3.1-8b": "http://vllm-economy:8000",
"nomic-embed-text": "http://embedding-service:8000",
}
def check_rate_limit(tenant_id: str, config: dict) -> bool:
"""Check and update rate limit for a tenant."""
key = f"ratelimit:{tenant_id}:{int(time.time() // 60)}"
current = redis_client.incr(key)
if current == 1:
redis_client.expire(key, 120)
return current <= config["rate_limits"]["requests_per_minute"]
def check_concurrent(tenant_id: str, config: dict) -> bool:
"""Check concurrent request limit."""
key = f"concurrent:{tenant_id}"
current = int(redis_client.get(key) or 0)
return current < config["rate_limits"]["concurrent_requests"]
def check_budget(tenant_id: str, config: dict) -> bool:
"""Check if tenant is within daily budget."""
key = f"spend:{tenant_id}:{time.strftime('%Y-%m-%d')}"
current_spend = float(redis_client.get(key) or 0)
return current_spend < config["budget"]["daily_limit_usd"]
def record_usage(tenant_id: str, model: str, prompt_tokens: int, completion_tokens: int):
"""Record token usage and cost for billing."""
# Cost rates per 1K tokens
rates = {
"llama-3.1-70b": {"prompt": 0.004, "completion": 0.012},
"llama-3.1-8b": {"prompt": 0.0005, "completion": 0.0015},
"nomic-embed-text": {"prompt": 0.0001, "completion": 0.0},
}
rate = rates.get(model, {"prompt": 0.001, "completion": 0.003})
cost = (prompt_tokens * rate["prompt"] + completion_tokens * rate["completion"]) / 1000
# Update daily spend
spend_key = f"spend:{tenant_id}:{time.strftime('%Y-%m-%d')}"
redis_client.incrbyfloat(spend_key, cost)
redis_client.expire(spend_key, 172800)
# Record for billing export
billing_key = f"billing:{tenant_id}:{time.strftime('%Y-%m')}"
redis_client.rpush(billing_key, json.dumps({
"timestamp": time.time(),
"model": model,
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"cost_usd": cost,
}))
@app.post("/v1/chat/completions")
async def chat_completions(
request: Request,
x_tenant_id: str = Header(...),
x_api_key: str = Header(...),
):
"""Route chat completion request with tenant controls."""
if x_tenant_id not in TENANT_CONFIG:
raise HTTPException(status_code=403, detail="Unknown tenant")
config = TENANT_CONFIG[x_tenant_id]
body = await request.json()
model = body.get("model", "llama-3.1-8b")
# Check model access
if model not in config["models_allowed"]:
raise HTTPException(status_code=403, detail=f"Model {model} not allowed for tenant")
# Check rate limit
if not check_rate_limit(x_tenant_id, config):
raise HTTPException(status_code=429, detail="Rate limit exceeded")
# Check concurrent requests
if not check_concurrent(x_tenant_id, config):
raise HTTPException(status_code=429, detail="Concurrent request limit exceeded")
# Check budget
if not check_budget(x_tenant_id, config):
raise HTTPException(status_code=402, detail="Daily budget exceeded")
# Route to model endpoint
endpoint = MODEL_ENDPOINTS.get(model)
if not endpoint:
raise HTTPException(status_code=404, detail=f"Model {model} not available")
# Track concurrent requests
concurrent_key = f"concurrent:{x_tenant_id}"
redis_client.incr(concurrent_key)
try:
async with httpx.AsyncClient(timeout=120.0) as client:
response = await client.post(
f"{endpoint}/v1/chat/completions",
json=body,
headers={"Content-Type": "application/json"},
)
result = response.json()
# Record usage
usage = result.get("usage", {})
record_usage(
x_tenant_id, model,
usage.get("prompt_tokens", 0),
usage.get("completion_tokens", 0),
)
return result
finally:
redis_client.decr(concurrent_key)
Rate Limiting with Envoy
# envoy-ratelimit.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: envoy-ratelimit-config
namespace: llm-serving
data:
config.yaml: |
domain: llm-gateway
descriptors:
# Per-tenant rate limits
- key: tenant_id
value: acme-corp
rate_limit:
unit: minute
requests_per_unit: 300
- key: tenant_id
value: startup-xyz
rate_limit:
unit: minute
requests_per_unit: 60
- key: tenant_id
value: internal-dev
rate_limit:
unit: minute
requests_per_unit: 20
# Global rate limit as safety net
- key: global
rate_limit:
unit: second
requests_per_unit: 100
Billing Integration
# billing_export.py
"""Export tenant usage data for billing systems."""
import redis
import json
from datetime import datetime, timedelta
from typing import Dict, List
redis_client = redis.Redis(host="redis", port=6379, decode_responses=True)
def generate_tenant_invoice(tenant_id: str, month: str) -> Dict:
"""Generate monthly invoice for a tenant."""
billing_key = f"billing:{tenant_id}:{month}"
records = redis_client.lrange(billing_key, 0, -1)
usage_by_model = {}
total_cost = 0.0
total_requests = 0
for record_json in records:
record = json.loads(record_json)
model = record["model"]
if model not in usage_by_model:
usage_by_model[model] = {
"requests": 0,
"prompt_tokens": 0,
"completion_tokens": 0,
"cost_usd": 0.0,
}
usage_by_model[model]["requests"] += 1
usage_by_model[model]["prompt_tokens"] += record["prompt_tokens"]
usage_by_model[model]["completion_tokens"] += record["completion_tokens"]
usage_by_model[model]["cost_usd"] += record["cost_usd"]
total_cost += record["cost_usd"]
total_requests += 1
return {
"tenant_id": tenant_id,
"billing_period": month,
"generated_at": datetime.utcnow().isoformat(),
"summary": {
"total_requests": total_requests,
"total_cost_usd": round(total_cost, 4),
},
"usage_by_model": usage_by_model,
}
def get_tenant_spend_today(tenant_id: str) -> float:
"""Get current day spend for budget alerts."""
key = f"spend:{tenant_id}:{datetime.utcnow().strftime('%Y-%m-%d')}"
return float(redis_client.get(key) or 0)
Noisy-Neighbor Controls
- Per-tenant RPM/TPM limits
- Concurrency caps and queue isolation
- Fair scheduling with weighted priority classes
- Backpressure and graceful degradation policies
# priority-classes.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: tenant-enterprise
value: 1000
globalDefault: false
description: "Enterprise tenant workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: tenant-standard
value: 500
globalDefault: false
description: "Standard tenant workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: tenant-free
value: 100
globalDefault: false
description: "Free tier tenant workloads"
Per-Tenant Monitoring
# tenant-alerts.yaml
groups:
- name: tenant-alerts
rules:
- alert: TenantBudgetWarning
expr: |
llm_tenant_daily_spend_usd
/ llm_tenant_daily_budget_usd > 0.80
for: 5m
labels:
severity: warning
annotations:
summary: "Tenant {{ $labels.tenant }} at 80% of daily budget"
- alert: TenantRateLimitHitting
expr: |
rate(llm_rate_limit_rejections_total[5m]) > 1
for: 5m
labels:
severity: info
annotations:
summary: "Tenant {{ $labels.tenant }} hitting rate limits"
- alert: TenantErrorRateHigh
expr: |
rate(llm_tenant_errors_total[5m])
/ rate(llm_tenant_requests_total[5m]) > 0.10
for: 5m
labels:
severity: warning
annotations:
summary: "Tenant {{ $labels.tenant }} error rate above 10%"
Security Baseline
- Encrypt data in transit and at rest.
- Disallow cross-tenant cache leakage.
- Restrict debug data access by role.
- Audit all privileged administrative actions.
Operational Runbook
- Onboard tenant with policy template.
- Issue virtual key and quota profile.
- Validate observability and billing tags.
- Run tenant-specific load/safety tests.
- Enable production traffic with canary limits.
Troubleshooting
| Symptom | Check | Fix |
|---|---|---|
| Tenant getting 429 errors | Rate limit counters in Redis | Increase RPM/TPM limits or upgrade tier |
| One tenant slowing others | Concurrent request counts per tenant | Reduce concurrency cap for offending tenant |
| Billing data missing | Redis billing keys and export job logs | Check billing export CronJob and Redis connectivity |
| Tenant cannot access model | Tenant config in ConfigMap | Add model to models_allowed list |
| Cross-tenant data leakage | Cache key prefixes and namespace isolation | Ensure cache keys include tenant_id prefix |
| Budget alerts not firing | Prometheus scrape targets and alert rules | Verify metric export and Alertmanager config |
Related Skills
- llm-gateway - Key management and traffic routing
- llm-cost-optimization - Cost controls and optimization tactics
- zero-trust - Identity-centric network and access patterns
- gpu-kubernetes-operations - GPU cluster management
- llm-inference-scaling - Autoscaling inference workloads
Weekly Installs
23
Repository
bagelhole/devop…t-skillsGitHub Stars
18
First Seen
5 days ago
Security Audits