llm-inference-scaling

SKILL.md

LLM Inference Scaling

Scale LLM inference horizontally on Kubernetes with GPU-aware autoscaling, request queuing, and cost-efficient spot instance strategies.

When to Use This Skill

Use this skill when:

  • LLM API traffic is unpredictable and you need to scale up/down automatically
  • Managing a fleet of vLLM or TGI inference pods on Kubernetes
  • Reducing inference costs with spot/preemptible GPU instances
  • Implementing queue-based autoscaling for batch inference jobs
  • Building a multi-model serving platform that shares GPU resources

Prerequisites

  • Kubernetes cluster with GPU nodes (NVIDIA operator installed)
  • KEDA (Kubernetes Event-Driven Autoscaler) installed
  • Prometheus with GPU metrics (dcgm-exporter or gpu-operator)
  • Helm 3+ for chart deployments

GPU Node Setup

# Install NVIDIA GPU Operator (handles drivers, container toolkit, DCGM)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set dcgm.enabled=true \
  --set devicePlugin.enabled=true

# Verify GPU nodes are recognized
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node <gpu-node> | grep nvidia

vLLM Deployment with GPU Resources

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-8b
  labels:
    app: vllm
    model: llama-3.1-8b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
      model: llama-3.1-8b
  template:
    metadata:
      labels:
        app: vllm
        model: llama-3.1-8b
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "meta-llama/Llama-3.1-8B-Instruct"
        - "--tensor-parallel-size"
        - "1"
        - "--gpu-memory-utilization"
        - "0.90"
        - "--max-num-seqs"
        - "128"
        resources:
          requests:
            nvidia.com/gpu: "1"
            memory: "20Gi"
            cpu: "4"
          limits:
            nvidia.com/gpu: "1"
            memory: "24Gi"
            cpu: "8"
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token

KEDA Autoscaling on Prometheus Metrics

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaledobject
spec:
  scaleTargetRef:
    name: vllm-llama-8b
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 300          # 5 min before scale-down
  pollingInterval: 15
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      metricName: vllm_num_requests_waiting
      threshold: "10"           # scale up if >10 requests waiting
      query: |
        sum(vllm:num_requests_waiting{deployment="vllm-llama-8b"})
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      metricName: vllm_gpu_cache_usage
      threshold: "0.8"          # scale up if KV cache >80% full
      query: |
        avg(vllm:gpu_cache_usage_perc{deployment="vllm-llama-8b"})

Queue-Based Scaling (Redis + KEDA)

# ScaledJob for async batch inference
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: llm-batch-inference
spec:
  jobTargetRef:
    template:
      spec:
        containers:
        - name: inference-worker
          image: myapp/inference-worker:latest
          env:
          - name: REDIS_URL
            value: redis://redis:6379
          - name: QUEUE_NAME
            value: inference-jobs
        restartPolicy: OnFailure
  minReplicaCount: 0
  maxReplicaCount: 20
  pollingInterval: 5
  successfulJobsHistoryLimit: 3
  triggers:
  - type: redis
    metadata:
      address: redis:6379
      listName: inference-jobs
      listLength: "5"           # 1 worker per 5 queued jobs

Spot Instance Strategy

# Mixed node pool: on-demand + spot GPUs
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-config
data:
  priorities: |
    10:  # low priority = prefer
    - .*spot.*
    50:
    - .*on-demand.*
---
# Node affinity for spot with on-demand fallback
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: node.kubernetes.io/lifecycle
            operator: In
            values: [spot]
      - weight: 20
        preference:
          matchExpressions:
          - key: node.kubernetes.io/lifecycle
            operator: In
            values: [on-demand]

Cluster Autoscaler for GPU Nodes

# AWS EKS — enable cluster autoscaler for GPU node group
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  --set autoDiscovery.clusterName=my-cluster \
  --set awsRegion=us-east-1 \
  --set rbac.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::ACCOUNT:role/ClusterAutoscalerRole \
  --set extraArgs.skip-nodes-with-local-storage=false \
  --set extraArgs.expander=least-waste

# Annotate GPU node group for autoscaler
kubectl annotate node <node> \
  cluster-autoscaler.kubernetes.io/safe-to-evict="false"

Scaling Metrics to Monitor

# Prometheus queries for scaling decisions
# Requests waiting in vLLM queue
sum(vllm:num_requests_waiting) by (model)

# GPU KV cache utilization (>80% = bottleneck)
avg(vllm:gpu_cache_usage_perc) by (pod)

# Tokens per second throughput
sum(rate(vllm:generation_tokens_total[5m])) by (model)

# P99 time-to-first-token
histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))

Common Issues

Issue Cause Fix
Pods stuck in Pending No GPU nodes available Check cluster autoscaler logs; verify node group limits
Scale-up too slow Cluster autoscaler delay + model load time Pre-warm replicas; increase minReplicaCount
GPU fragmentation Multiple small models on large GPUs Use MIG partitioning or consolidate model sizes
Spot eviction causes errors Spot instance reclamation Add PodDisruptionBudget; use graceful shutdown
KEDA not scaling Prometheus query returns no data Test query in Prometheus UI first

Best Practices

  • Set minReplicaCount: 1 to avoid cold starts; scale to 0 only for batch jobs.
  • Use PodDisruptionBudget with minAvailable: 1 to survive spot evictions.
  • Pre-pull model weights into a shared PVC to speed up pod startup by 5–10×.
  • Separate model families across node pools (A10G for 7B, A100 for 70B).
  • Use Kubernetes VPA for CPU/memory right-sizing alongside KEDA for replica count.

Related Skills

Weekly Installs
2
GitHub Stars
13
First Seen
5 days ago
Installed on
opencode2
antigravity2
claude-code2
github-copilot2
codex2
zencoder2