Multi-Tenant LLM Hosting

Host many teams/customers on shared inference infrastructure without sacrificing security, performance, or cost governance.

When to Use This Skill

Building an internal LLM platform shared by multiple teams
Hosting LLM inference for external customers with isolation requirements
Implementing per-tenant quotas, billing, and rate limiting
Designing request routing for multi-model, multi-tenant environments
Preventing noisy-neighbor issues on shared GPU infrastructure

Prerequisites

Kubernetes cluster with GPU node pools
API gateway or LLM gateway (LiteLLM, Envoy, Kong)
Prometheus + Grafana for per-tenant observability
Redis or equivalent for rate limiting state
Billing system or cost attribution database

Isolation Model

Strong tenant identity on every request
Per-tenant API keys and scoped model access
Namespace or workload isolation for high-risk tenants
Strict data retention and log partitioning controls

vLLM Multi-Model Serving

# vllm-deployment.yaml - Multi-model serving with vLLM
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gpt4o-equivalent
  namespace: llm-serving
  labels:
    app: vllm
    model-tier: premium
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm
      model-tier: premium
  template:
    metadata:
      labels:
        app: vllm
        model-tier: premium
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.4.1
          args:
            - "--model=/models/llama-3.1-70b"
            - "--tensor-parallel-size=2"
            - "--max-model-len=8192"
            - "--gpu-memory-utilization=0.90"
            - "--max-num-seqs=128"
            - "--enable-prefix-caching"
          ports:
            - containerPort: 8000
              name: inference
            - containerPort: 8080
              name: metrics
          resources:
            requests:
              nvidia.com/gpu: 2
              cpu: "8"
              memory: "64Gi"
            limits:
              nvidia.com/gpu: 2
              cpu: "16"
              memory: "128Gi"
          volumeMounts:
            - name: model-weights
              mountPath: /models
              readOnly: true
      volumes:
        - name: model-weights
          persistentVolumeClaim:
            claimName: premium-model-weights
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        gpu-type: a100
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-economy
  namespace: llm-serving
  labels:
    app: vllm
    model-tier: economy
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
      model-tier: economy
  template:
    metadata:
      labels:
        app: vllm
        model-tier: economy
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.4.1
          args:
            - "--model=/models/llama-3.1-8b"
            - "--max-model-len=4096"
            - "--gpu-memory-utilization=0.85"
            - "--max-num-seqs=256"
            - "--enable-prefix-caching"
          ports:
            - containerPort: 8000
              name: inference
            - containerPort: 8080
              name: metrics
          resources:
            requests:
              nvidia.com/gpu: 1
              cpu: "4"
              memory: "32Gi"
            limits:
              nvidia.com/gpu: 1
              cpu: "8"
              memory: "64Gi"
          volumeMounts:
            - name: model-weights
              mountPath: /models
              readOnly: true
      volumes:
        - name: model-weights
          persistentVolumeClaim:
            claimName: economy-model-weights
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule

Per-Tenant Quota Configuration

# tenant-quotas-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-quotas
  namespace: llm-serving
data:
  quotas.yaml: |
    tenants:
      acme-corp:
        tier: enterprise
        models_allowed:
          - llama-3.1-70b
          - llama-3.1-8b
          - nomic-embed-text
        rate_limits:
          requests_per_minute: 300
          tokens_per_minute: 500000
          concurrent_requests: 50
        budget:
          daily_limit_usd: 500.00
          monthly_limit_usd: 10000.00
          alert_threshold_percent: 80
        priority: high

      startup-xyz:
        tier: standard
        models_allowed:
          - llama-3.1-8b
          - nomic-embed-text
        rate_limits:
          requests_per_minute: 60
          tokens_per_minute: 100000
          concurrent_requests: 10
        budget:
          daily_limit_usd: 50.00
          monthly_limit_usd: 1000.00
          alert_threshold_percent: 80
        priority: medium

      internal-dev:
        tier: free
        models_allowed:
          - llama-3.1-8b
        rate_limits:
          requests_per_minute: 20
          tokens_per_minute: 50000
          concurrent_requests: 5
        budget:
          daily_limit_usd: 10.00
          monthly_limit_usd: 200.00
          alert_threshold_percent: 90
        priority: low

Namespace Isolation for High-Risk Tenants

# tenant-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: tenant-acme-corp
  labels:
    tenant: acme-corp
    isolation: strict
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tenant-isolation
  namespace: tenant-acme-corp
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: llm-gateway
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: llm-serving
      ports:
        - port: 8000
          protocol: TCP
    - to:
        - namespaceSelector:
            matchLabels:
              name: kube-dns
      ports:
        - port: 53
          protocol: UDP
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-quota
  namespace: tenant-acme-corp
spec:
  hard:
    requests.cpu: "16"
    requests.memory: "64Gi"
    limits.cpu: "32"
    limits.memory: "128Gi"
    requests.nvidia.com/gpu: "4"
    pods: "20"

Request Routing and Rate Limiting

# gateway_router.py
"""Multi-tenant request router with rate limiting and model routing."""
import time
import json
import redis
from fastapi import FastAPI, HTTPException, Header, Request
from typing import Optional
import httpx
import yaml

app = FastAPI()
redis_client = redis.Redis(host="redis", port=6379, decode_responses=True)

# Load tenant config
with open("/etc/config/quotas.yaml") as f:
    TENANT_CONFIG = yaml.safe_load(f)["tenants"]

MODEL_ENDPOINTS = {
    "llama-3.1-70b": "http://vllm-gpt4o-equivalent:8000",
    "llama-3.1-8b": "http://vllm-economy:8000",
    "nomic-embed-text": "http://embedding-service:8000",
}

def check_rate_limit(tenant_id: str, config: dict) -> bool:
    """Check and update rate limit for a tenant."""
    key = f"ratelimit:{tenant_id}:{int(time.time() // 60)}"
    current = redis_client.incr(key)
    if current == 1:
        redis_client.expire(key, 120)
    return current <= config["rate_limits"]["requests_per_minute"]

def check_concurrent(tenant_id: str, config: dict) -> bool:
    """Check concurrent request limit."""
    key = f"concurrent:{tenant_id}"
    current = int(redis_client.get(key) or 0)
    return current < config["rate_limits"]["concurrent_requests"]

def check_budget(tenant_id: str, config: dict) -> bool:
    """Check if tenant is within daily budget."""
    key = f"spend:{tenant_id}:{time.strftime('%Y-%m-%d')}"
    current_spend = float(redis_client.get(key) or 0)
    return current_spend < config["budget"]["daily_limit_usd"]

def record_usage(tenant_id: str, model: str, prompt_tokens: int, completion_tokens: int):
    """Record token usage and cost for billing."""
    # Cost rates per 1K tokens
    rates = {
        "llama-3.1-70b": {"prompt": 0.004, "completion": 0.012},
        "llama-3.1-8b": {"prompt": 0.0005, "completion": 0.0015},
        "nomic-embed-text": {"prompt": 0.0001, "completion": 0.0},
    }
    rate = rates.get(model, {"prompt": 0.001, "completion": 0.003})
    cost = (prompt_tokens * rate["prompt"] + completion_tokens * rate["completion"]) / 1000

    # Update daily spend
    spend_key = f"spend:{tenant_id}:{time.strftime('%Y-%m-%d')}"
    redis_client.incrbyfloat(spend_key, cost)
    redis_client.expire(spend_key, 172800)

    # Record for billing export
    billing_key = f"billing:{tenant_id}:{time.strftime('%Y-%m')}"
    redis_client.rpush(billing_key, json.dumps({
        "timestamp": time.time(),
        "model": model,
        "prompt_tokens": prompt_tokens,
        "completion_tokens": completion_tokens,
        "cost_usd": cost,
    }))

@app.post("/v1/chat/completions")
async def chat_completions(
    request: Request,
    x_tenant_id: str = Header(...),
    x_api_key: str = Header(...),
):
    """Route chat completion request with tenant controls."""
    if x_tenant_id not in TENANT_CONFIG:
        raise HTTPException(status_code=403, detail="Unknown tenant")

    config = TENANT_CONFIG[x_tenant_id]
    body = await request.json()
    model = body.get("model", "llama-3.1-8b")

    # Check model access
    if model not in config["models_allowed"]:
        raise HTTPException(status_code=403, detail=f"Model {model} not allowed for tenant")

    # Check rate limit
    if not check_rate_limit(x_tenant_id, config):
        raise HTTPException(status_code=429, detail="Rate limit exceeded")

    # Check concurrent requests
    if not check_concurrent(x_tenant_id, config):
        raise HTTPException(status_code=429, detail="Concurrent request limit exceeded")

    # Check budget
    if not check_budget(x_tenant_id, config):
        raise HTTPException(status_code=402, detail="Daily budget exceeded")

    # Route to model endpoint
    endpoint = MODEL_ENDPOINTS.get(model)
    if not endpoint:
        raise HTTPException(status_code=404, detail=f"Model {model} not available")

    # Track concurrent requests
    concurrent_key = f"concurrent:{x_tenant_id}"
    redis_client.incr(concurrent_key)

    try:
        async with httpx.AsyncClient(timeout=120.0) as client:
            response = await client.post(
                f"{endpoint}/v1/chat/completions",
                json=body,
                headers={"Content-Type": "application/json"},
            )
            result = response.json()

            # Record usage
            usage = result.get("usage", {})
            record_usage(
                x_tenant_id, model,
                usage.get("prompt_tokens", 0),
                usage.get("completion_tokens", 0),
            )

            return result
    finally:
        redis_client.decr(concurrent_key)

Rate Limiting with Envoy

# envoy-ratelimit.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: envoy-ratelimit-config
  namespace: llm-serving
data:
  config.yaml: |
    domain: llm-gateway
    descriptors:
      # Per-tenant rate limits
      - key: tenant_id
        value: acme-corp
        rate_limit:
          unit: minute
          requests_per_unit: 300
      - key: tenant_id
        value: startup-xyz
        rate_limit:
          unit: minute
          requests_per_unit: 60
      - key: tenant_id
        value: internal-dev
        rate_limit:
          unit: minute
          requests_per_unit: 20

      # Global rate limit as safety net
      - key: global
        rate_limit:
          unit: second
          requests_per_unit: 100

Billing Integration

# billing_export.py
"""Export tenant usage data for billing systems."""
import redis
import json
from datetime import datetime, timedelta
from typing import Dict, List

redis_client = redis.Redis(host="redis", port=6379, decode_responses=True)

def generate_tenant_invoice(tenant_id: str, month: str) -> Dict:
    """Generate monthly invoice for a tenant."""
    billing_key = f"billing:{tenant_id}:{month}"
    records = redis_client.lrange(billing_key, 0, -1)

    usage_by_model = {}
    total_cost = 0.0
    total_requests = 0

    for record_json in records:
        record = json.loads(record_json)
        model = record["model"]

        if model not in usage_by_model:
            usage_by_model[model] = {
                "requests": 0,
                "prompt_tokens": 0,
                "completion_tokens": 0,
                "cost_usd": 0.0,
            }

        usage_by_model[model]["requests"] += 1
        usage_by_model[model]["prompt_tokens"] += record["prompt_tokens"]
        usage_by_model[model]["completion_tokens"] += record["completion_tokens"]
        usage_by_model[model]["cost_usd"] += record["cost_usd"]

        total_cost += record["cost_usd"]
        total_requests += 1

    return {
        "tenant_id": tenant_id,
        "billing_period": month,
        "generated_at": datetime.utcnow().isoformat(),
        "summary": {
            "total_requests": total_requests,
            "total_cost_usd": round(total_cost, 4),
        },
        "usage_by_model": usage_by_model,
    }

def get_tenant_spend_today(tenant_id: str) -> float:
    """Get current day spend for budget alerts."""
    key = f"spend:{tenant_id}:{datetime.utcnow().strftime('%Y-%m-%d')}"
    return float(redis_client.get(key) or 0)

Noisy-Neighbor Controls

Per-tenant RPM/TPM limits
Concurrency caps and queue isolation
Fair scheduling with weighted priority classes
Backpressure and graceful degradation policies

# priority-classes.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: tenant-enterprise
value: 1000
globalDefault: false
description: "Enterprise tenant workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: tenant-standard
value: 500
globalDefault: false
description: "Standard tenant workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: tenant-free
value: 100
globalDefault: false
description: "Free tier tenant workloads"

Per-Tenant Monitoring

# tenant-alerts.yaml
groups:
  - name: tenant-alerts
    rules:
      - alert: TenantBudgetWarning
        expr: |
          llm_tenant_daily_spend_usd
          / llm_tenant_daily_budget_usd > 0.80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Tenant {{ $labels.tenant }} at 80% of daily budget"

      - alert: TenantRateLimitHitting
        expr: |
          rate(llm_rate_limit_rejections_total[5m]) > 1
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "Tenant {{ $labels.tenant }} hitting rate limits"

      - alert: TenantErrorRateHigh
        expr: |
          rate(llm_tenant_errors_total[5m])
          / rate(llm_tenant_requests_total[5m]) > 0.10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Tenant {{ $labels.tenant }} error rate above 10%"

Security Baseline

Encrypt data in transit and at rest.
Disallow cross-tenant cache leakage.
Restrict debug data access by role.
Audit all privileged administrative actions.

Operational Runbook

Onboard tenant with policy template.
Issue virtual key and quota profile.
Validate observability and billing tags.
Run tenant-specific load/safety tests.
Enable production traffic with canary limits.

Troubleshooting

Symptom	Check	Fix
Tenant getting 429 errors	Rate limit counters in Redis	Increase RPM/TPM limits or upgrade tier
One tenant slowing others	Concurrent request counts per tenant	Reduce concurrency cap for offending tenant
Billing data missing	Redis billing keys and export job logs	Check billing export CronJob and Redis connectivity
Tenant cannot access model	Tenant config in ConfigMap	Add model to `models_allowed` list
Cross-tenant data leakage	Cache key prefixes and namespace isolation	Ensure cache keys include tenant_id prefix
Budget alerts not firing	Prometheus scrape targets and alert rules	Verify metric export and Alertmanager config

Related Skills

llm-gateway - Key management and traffic routing
llm-cost-optimization - Cost controls and optimization tactics
zero-trust - Identity-centric network and access patterns
gpu-kubernetes-operations - GPU cluster management
llm-inference-scaling - Autoscaling inference workloads

multi-tenant-llm-hosting

Multi-Tenant LLM Hosting

When to Use This Skill

Prerequisites

Isolation Model

vLLM Multi-Model Serving

Per-Tenant Quota Configuration

Namespace Isolation for High-Risk Tenants

Request Routing and Rate Limiting

Rate Limiting with Envoy

Billing Integration

Noisy-Neighbor Controls

Per-Tenant Monitoring

Security Baseline

Operational Runbook

Troubleshooting

Related Skills