LLM Inference Scaling

Scale LLM inference horizontally on Kubernetes with GPU-aware autoscaling, request queuing, and cost-efficient spot instance strategies.

When to Use This Skill

Use this skill when:

LLM API traffic is unpredictable and you need to scale up/down automatically
Managing a fleet of vLLM or TGI inference pods on Kubernetes
Reducing inference costs with spot/preemptible GPU instances
Implementing queue-based autoscaling for batch inference jobs
Building a multi-model serving platform that shares GPU resources

Prerequisites

Kubernetes cluster with GPU nodes (NVIDIA operator installed)
KEDA (Kubernetes Event-Driven Autoscaler) installed
Prometheus with GPU metrics (dcgm-exporter or gpu-operator)
Helm 3+ for chart deployments

GPU Node Setup

# Install NVIDIA GPU Operator (handles drivers, container toolkit, DCGM)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set dcgm.enabled=true \
  --set devicePlugin.enabled=true

# Verify GPU nodes are recognized
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node <gpu-node> | grep nvidia

vLLM Deployment with GPU Resources

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-8b
  labels:
    app: vllm
    model: llama-3.1-8b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
      model: llama-3.1-8b
  template:
    metadata:
      labels:
        app: vllm
        model: llama-3.1-8b
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "meta-llama/Llama-3.1-8B-Instruct"
        - "--tensor-parallel-size"
        - "1"
        - "--gpu-memory-utilization"
        - "0.90"
        - "--max-num-seqs"
        - "128"
        resources:
          requests:
            nvidia.com/gpu: "1"
            memory: "20Gi"
            cpu: "4"
          limits:
            nvidia.com/gpu: "1"
            memory: "24Gi"
            cpu: "8"
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token

KEDA Autoscaling on Prometheus Metrics

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaledobject
spec:
  scaleTargetRef:
    name: vllm-llama-8b
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 300          # 5 min before scale-down
  pollingInterval: 15
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      metricName: vllm_num_requests_waiting
      threshold: "10"           # scale up if >10 requests waiting
      query: |
        sum(vllm:num_requests_waiting{deployment="vllm-llama-8b"})
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      metricName: vllm_gpu_cache_usage
      threshold: "0.8"          # scale up if KV cache >80% full
      query: |
        avg(vllm:gpu_cache_usage_perc{deployment="vllm-llama-8b"})

Queue-Based Scaling (Redis + KEDA)

# ScaledJob for async batch inference
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: llm-batch-inference
spec:
  jobTargetRef:
    template:
      spec:
        containers:
        - name: inference-worker
          image: myapp/inference-worker:latest
          env:
          - name: REDIS_URL
            value: redis://redis:6379
          - name: QUEUE_NAME
            value: inference-jobs
        restartPolicy: OnFailure
  minReplicaCount: 0
  maxReplicaCount: 20
  pollingInterval: 5
  successfulJobsHistoryLimit: 3
  triggers:
  - type: redis
    metadata:
      address: redis:6379
      listName: inference-jobs
      listLength: "5"           # 1 worker per 5 queued jobs

Spot Instance Strategy

# Mixed node pool: on-demand + spot GPUs
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-config
data:
  priorities: |
    10:  # low priority = prefer
    - .*spot.*
    50:
    - .*on-demand.*
---
# Node affinity for spot with on-demand fallback
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: node.kubernetes.io/lifecycle
            operator: In
            values: [spot]
      - weight: 20
        preference:
          matchExpressions:
          - key: node.kubernetes.io/lifecycle
            operator: In
            values: [on-demand]

Cluster Autoscaler for GPU Nodes

# AWS EKS — enable cluster autoscaler for GPU node group
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  --set autoDiscovery.clusterName=my-cluster \
  --set awsRegion=us-east-1 \
  --set rbac.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::ACCOUNT:role/ClusterAutoscalerRole \
  --set extraArgs.skip-nodes-with-local-storage=false \
  --set extraArgs.expander=least-waste

# Annotate GPU node group for autoscaler
kubectl annotate node <node> \
  cluster-autoscaler.kubernetes.io/safe-to-evict="false"

Scaling Metrics to Monitor

# Prometheus queries for scaling decisions
# Requests waiting in vLLM queue
sum(vllm:num_requests_waiting) by (model)

# GPU KV cache utilization (>80% = bottleneck)
avg(vllm:gpu_cache_usage_perc) by (pod)

# Tokens per second throughput
sum(rate(vllm:generation_tokens_total[5m])) by (model)

# P99 time-to-first-token
histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))

Common Issues

Issue	Cause	Fix
Pods stuck in `Pending`	No GPU nodes available	Check cluster autoscaler logs; verify node group limits
Scale-up too slow	Cluster autoscaler delay + model load time	Pre-warm replicas; increase `minReplicaCount`
GPU fragmentation	Multiple small models on large GPUs	Use MIG partitioning or consolidate model sizes
Spot eviction causes errors	Spot instance reclamation	Add `PodDisruptionBudget`; use graceful shutdown
KEDA not scaling	Prometheus query returns no data	Test query in Prometheus UI first

Best Practices

Set minReplicaCount: 1 to avoid cold starts; scale to 0 only for batch jobs.
Use PodDisruptionBudget with minAvailable: 1 to survive spot evictions.
Pre-pull model weights into a shared PVC to speed up pod startup by 5–10×.
Separate model families across node pools (A10G for 7B, A100 for 70B).
Use Kubernetes VPA for CPU/memory right-sizing alongside KEDA for replica count.

Related Skills

vllm-server - vLLM configuration and tuning
gpu-server-management - GPU node setup
model-serving-kubernetes - KServe
kubernetes-ops - Core Kubernetes
llm-cost-optimization - Cost strategies

llm-inference-scaling

LLM Inference Scaling

When to Use This Skill

Prerequisites

GPU Node Setup

vLLM Deployment with GPU Resources

KEDA Autoscaling on Prometheus Metrics

Queue-Based Scaling (Redis + KEDA)

Spot Instance Strategy

Cluster Autoscaler for GPU Nodes

Scaling Metrics to Monitor

Common Issues

Best Practices

Related Skills