Model Serving on Kubernetes

Production ML model serving with KServe and Triton — canary deployments, autoscaling, and GPU-aware scheduling.

When to Use This Skill

Use this skill when:

Serving scikit-learn, PyTorch, TensorFlow, or ONNX models at scale
Implementing canary deployments and A/B testing for ML models
Autoscaling inference pods based on request rate or GPU metrics
Deploying LLMs with Triton or KServe on Kubernetes
Managing multiple model versions with traffic splitting

Prerequisites

Kubernetes 1.28+ with GPU nodes
KServe installed (or Triton standalone)
kubectl and helm configured
NVIDIA GPU Operator installed on cluster

KServe Installation

# Install KServe with Helm
helm repo add kserve https://kserve.github.io/helm-charts
helm repo update

helm install kserve kserve/kserve \
  --namespace kserve \
  --create-namespace \
  --set kserve.controller.gateway.ingressGateway.className=nginx

# Verify
kubectl get pods -n kserve
kubectl get crd | grep kserve

Basic InferenceService (KServe)

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  namespace: models
spec:
  predictor:
    sklearn:
      storageUri: gs://kfserving-examples/models/sklearn/1.0/model
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "2"
          memory: 4Gi

kubectl apply -f inference-service.yaml

# Get inference service URL
kubectl get inferenceservice sklearn-iris -n models
# NAME           URL                                          READY   ...
# sklearn-iris   http://sklearn-iris.models.example.com       True

# Test prediction
curl -X POST http://sklearn-iris.models.example.com/v1/models/sklearn-iris:predict \
  -H "Content-Type: application/json" \
  -d '{"instances": [[6.8, 2.8, 4.8, 1.4]]}'

GPU-Enabled LLM InferenceService

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-8b
  namespace: models
  annotations:
    serving.kserve.io/enable-prometheus-scraping: "true"
spec:
  predictor:
    containers:
    - name: vllm-container
      image: vllm/vllm-openai:latest
      args:
      - "--model"
      - "meta-llama/Llama-3.1-8B-Instruct"
      - "--tensor-parallel-size"
      - "1"
      - "--gpu-memory-utilization"
      - "0.90"
      ports:
      - containerPort: 8080
        protocol: TCP
      resources:
        requests:
          nvidia.com/gpu: "1"
          memory: "20Gi"
          cpu: "4"
        limits:
          nvidia.com/gpu: "1"
          memory: "24Gi"
          cpu: "8"
      readinessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 60
        periodSeconds: 10
      env:
      - name: HUGGING_FACE_HUB_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-token
            key: token
    nodeSelector:
      nvidia.com/gpu.present: "true"
  transformer:
    containers:
    - name: kserve-container
      image: kserve/kserve-transformer:latest

Canary Deployment (Traffic Splitting)

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-8b
  namespace: models
spec:
  predictor:
    canaryTrafficPercent: 20    # 20% to new version, 80% to stable
    containers:
    - name: vllm-container
      image: vllm/vllm-openai:latest
      args:
      - "--model"
      - "meta-llama/Llama-3.1-8B-Instruct-v2"  # new model version
      resources:
        limits:
          nvidia.com/gpu: "1"

# Gradually increase canary traffic
kubectl patch inferenceservice llama-3-8b -n models \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/predictor/canaryTrafficPercent","value":50}]'

# Promote canary to stable
kubectl patch inferenceservice llama-3-8b -n models \
  --type='json' \
  -p='[{"op":"remove","path":"/spec/predictor/canaryTrafficPercent"}]'

Autoscaling with KEDA

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llama-scaler
  namespace: models
spec:
  scaleTargetRef:
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    name: llama-3-8b
  minReplicaCount: 1
  maxReplicaCount: 5
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      metricName: kserve_request_count
      threshold: "10"
      query: |
        sum(rate(kserve_request_count_total{namespace="models",
            service="llama-3-8b"}[1m]))

NVIDIA Triton Inference Server

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
  namespace: models
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.05-py3
        args:
        - "tritonserver"
        - "--model-store=s3://my-model-store/models"
        - "--model-control-mode=poll"        # auto-load new model versions
        - "--repository-poll-secs=30"
        - "--metrics-port=8002"
        ports:
        - containerPort: 8000   # HTTP
        - containerPort: 8001   # gRPC
        - containerPort: 8002   # Metrics
        resources:
          limits:
            nvidia.com/gpu: "1"
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 30

Triton Model Repository Structure

s3://my-model-store/models/
├── text-classifier/
│   ├── config.pbtxt
│   ├── 1/
│   │   └── model.onnx
│   └── 2/
│       └── model.onnx          # new version; auto-loaded
├── embedding-model/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx

# config.pbtxt for ONNX model
name: "text-classifier"
backend: "onnxruntime"
max_batch_size: 64
dynamic_batching {
  preferred_batch_size: [16, 32]
  max_queue_delay_microseconds: 1000
}
input [
  { name: "input_ids" data_type: TYPE_INT64 dims: [-1] }
  { name: "attention_mask" data_type: TYPE_INT64 dims: [-1] }
]
output [
  { name: "logits" data_type: TYPE_FP32 dims: [-1] }
]
instance_group [
  { kind: KIND_GPU count: 2 }   # 2 model instances on GPU
]

Model Management Commands

# List loaded models (Triton)
curl http://triton:8000/v2/models

# Load a new model version
curl -X POST http://triton:8000/v2/repository/models/text-classifier/load

# Unload a model
curl -X POST http://triton:8000/v2/repository/models/text-classifier/unload

# KServe — watch rollout status
kubectl rollout status deployment/llama-3-8b-predictor -n models
kubectl get inferenceservice llama-3-8b -n models -w

Common Issues

Issue	Cause	Fix
`InferenceService not ready`	Model loading or OOM	Check predictor pod logs; increase memory limits
Canary stuck at 0%	KNative routing issue	Check `kubectl get ksvc -n models`
Triton missing model	S3 permissions or path	Verify IAM role; check `--model-store` path
Low GPU utilization	Dynamic batching off	Enable `dynamic_batching` in Triton config
Autoscaler not triggering	Prometheus query wrong	Test query in Prometheus UI

Best Practices

Use canary deployments for all model updates — roll back in seconds if metrics degrade.
Enable Triton dynamic batching — it can increase GPU throughput 5–10× for small models.
Store models in S3/GCS with versioned paths (s3://bucket/model/v1/, v2/).
Pin GPU node selectors to prevent model pods landing on CPU-only nodes.
Monitor p99 latency and error rates per model version during canary rollouts.

Related Skills

vllm-server - vLLM for LLM serving
llm-inference-scaling - KEDA autoscaling
kubernetes-ops - Core Kubernetes operations
gpu-server-management - GPU nodes

model-serving-kubernetes

Model Serving on Kubernetes

When to Use This Skill

Prerequisites

KServe Installation

Basic InferenceService (KServe)

GPU-Enabled LLM InferenceService

Canary Deployment (Traffic Splitting)

Autoscaling with KEDA

NVIDIA Triton Inference Server

Triton Model Repository Structure

Model Management Commands

Common Issues

Best Practices

Related Skills