llm-inference-scaling
SKILL.md
LLM Inference Scaling
Scale LLM inference horizontally on Kubernetes with GPU-aware autoscaling, request queuing, and cost-efficient spot instance strategies.
When to Use This Skill
Use this skill when:
- LLM API traffic is unpredictable and you need to scale up/down automatically
- Managing a fleet of vLLM or TGI inference pods on Kubernetes
- Reducing inference costs with spot/preemptible GPU instances
- Implementing queue-based autoscaling for batch inference jobs
- Building a multi-model serving platform that shares GPU resources
Prerequisites
- Kubernetes cluster with GPU nodes (NVIDIA operator installed)
- KEDA (Kubernetes Event-Driven Autoscaler) installed
- Prometheus with GPU metrics (
dcgm-exporterorgpu-operator) - Helm 3+ for chart deployments
GPU Node Setup
# Install NVIDIA GPU Operator (handles drivers, container toolkit, DCGM)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set dcgm.enabled=true \
--set devicePlugin.enabled=true
# Verify GPU nodes are recognized
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node <gpu-node> | grep nvidia
vLLM Deployment with GPU Resources
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama-8b
labels:
app: vllm
model: llama-3.1-8b
spec:
replicas: 1
selector:
matchLabels:
app: vllm
model: llama-3.1-8b
template:
metadata:
labels:
app: vllm
model: llama-3.1-8b
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--tensor-parallel-size"
- "1"
- "--gpu-memory-utilization"
- "0.90"
- "--max-num-seqs"
- "128"
resources:
requests:
nvidia.com/gpu: "1"
memory: "20Gi"
cpu: "4"
limits:
nvidia.com/gpu: "1"
memory: "24Gi"
cpu: "8"
ports:
- containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
KEDA Autoscaling on Prometheus Metrics
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaledobject
spec:
scaleTargetRef:
name: vllm-llama-8b
minReplicaCount: 1
maxReplicaCount: 8
cooldownPeriod: 300 # 5 min before scale-down
pollingInterval: 15
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring:9090
metricName: vllm_num_requests_waiting
threshold: "10" # scale up if >10 requests waiting
query: |
sum(vllm:num_requests_waiting{deployment="vllm-llama-8b"})
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring:9090
metricName: vllm_gpu_cache_usage
threshold: "0.8" # scale up if KV cache >80% full
query: |
avg(vllm:gpu_cache_usage_perc{deployment="vllm-llama-8b"})
Queue-Based Scaling (Redis + KEDA)
# ScaledJob for async batch inference
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: llm-batch-inference
spec:
jobTargetRef:
template:
spec:
containers:
- name: inference-worker
image: myapp/inference-worker:latest
env:
- name: REDIS_URL
value: redis://redis:6379
- name: QUEUE_NAME
value: inference-jobs
restartPolicy: OnFailure
minReplicaCount: 0
maxReplicaCount: 20
pollingInterval: 5
successfulJobsHistoryLimit: 3
triggers:
- type: redis
metadata:
address: redis:6379
listName: inference-jobs
listLength: "5" # 1 worker per 5 queued jobs
Spot Instance Strategy
# Mixed node pool: on-demand + spot GPUs
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-config
data:
priorities: |
10: # low priority = prefer
- .*spot.*
50:
- .*on-demand.*
---
# Node affinity for spot with on-demand fallback
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: node.kubernetes.io/lifecycle
operator: In
values: [spot]
- weight: 20
preference:
matchExpressions:
- key: node.kubernetes.io/lifecycle
operator: In
values: [on-demand]
Cluster Autoscaler for GPU Nodes
# AWS EKS — enable cluster autoscaler for GPU node group
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
--namespace kube-system \
--set autoDiscovery.clusterName=my-cluster \
--set awsRegion=us-east-1 \
--set rbac.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::ACCOUNT:role/ClusterAutoscalerRole \
--set extraArgs.skip-nodes-with-local-storage=false \
--set extraArgs.expander=least-waste
# Annotate GPU node group for autoscaler
kubectl annotate node <node> \
cluster-autoscaler.kubernetes.io/safe-to-evict="false"
Scaling Metrics to Monitor
# Prometheus queries for scaling decisions
# Requests waiting in vLLM queue
sum(vllm:num_requests_waiting) by (model)
# GPU KV cache utilization (>80% = bottleneck)
avg(vllm:gpu_cache_usage_perc) by (pod)
# Tokens per second throughput
sum(rate(vllm:generation_tokens_total[5m])) by (model)
# P99 time-to-first-token
histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))
Common Issues
| Issue | Cause | Fix |
|---|---|---|
Pods stuck in Pending |
No GPU nodes available | Check cluster autoscaler logs; verify node group limits |
| Scale-up too slow | Cluster autoscaler delay + model load time | Pre-warm replicas; increase minReplicaCount |
| GPU fragmentation | Multiple small models on large GPUs | Use MIG partitioning or consolidate model sizes |
| Spot eviction causes errors | Spot instance reclamation | Add PodDisruptionBudget; use graceful shutdown |
| KEDA not scaling | Prometheus query returns no data | Test query in Prometheus UI first |
Best Practices
- Set
minReplicaCount: 1to avoid cold starts; scale to 0 only for batch jobs. - Use
PodDisruptionBudgetwithminAvailable: 1to survive spot evictions. - Pre-pull model weights into a shared PVC to speed up pod startup by 5–10×.
- Separate model families across node pools (A10G for 7B, A100 for 70B).
- Use Kubernetes VPA for CPU/memory right-sizing alongside KEDA for replica count.
Related Skills
- vllm-server - vLLM configuration and tuning
- gpu-server-management - GPU node setup
- model-serving-kubernetes - KServe
- kubernetes-ops - Core Kubernetes
- llm-cost-optimization - Cost strategies
Weekly Installs
2
Repository
bagelhole/devop…t-skillsGitHub Stars
13
First Seen
5 days ago
Security Audits
Installed on
opencode2
antigravity2
claude-code2
github-copilot2
codex2
zencoder2