model-serving-kubernetes

SKILL.md

Model Serving on Kubernetes

Production ML model serving with KServe and Triton — canary deployments, autoscaling, and GPU-aware scheduling.

When to Use This Skill

Use this skill when:

  • Serving scikit-learn, PyTorch, TensorFlow, or ONNX models at scale
  • Implementing canary deployments and A/B testing for ML models
  • Autoscaling inference pods based on request rate or GPU metrics
  • Deploying LLMs with Triton or KServe on Kubernetes
  • Managing multiple model versions with traffic splitting

Prerequisites

  • Kubernetes 1.28+ with GPU nodes
  • KServe installed (or Triton standalone)
  • kubectl and helm configured
  • NVIDIA GPU Operator installed on cluster

KServe Installation

# Install KServe with Helm
helm repo add kserve https://kserve.github.io/helm-charts
helm repo update

helm install kserve kserve/kserve \
  --namespace kserve \
  --create-namespace \
  --set kserve.controller.gateway.ingressGateway.className=nginx

# Verify
kubectl get pods -n kserve
kubectl get crd | grep kserve

Basic InferenceService (KServe)

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  namespace: models
spec:
  predictor:
    sklearn:
      storageUri: gs://kfserving-examples/models/sklearn/1.0/model
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "2"
          memory: 4Gi
kubectl apply -f inference-service.yaml

# Get inference service URL
kubectl get inferenceservice sklearn-iris -n models
# NAME           URL                                          READY   ...
# sklearn-iris   http://sklearn-iris.models.example.com       True

# Test prediction
curl -X POST http://sklearn-iris.models.example.com/v1/models/sklearn-iris:predict \
  -H "Content-Type: application/json" \
  -d '{"instances": [[6.8, 2.8, 4.8, 1.4]]}'

GPU-Enabled LLM InferenceService

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-8b
  namespace: models
  annotations:
    serving.kserve.io/enable-prometheus-scraping: "true"
spec:
  predictor:
    containers:
    - name: vllm-container
      image: vllm/vllm-openai:latest
      args:
      - "--model"
      - "meta-llama/Llama-3.1-8B-Instruct"
      - "--tensor-parallel-size"
      - "1"
      - "--gpu-memory-utilization"
      - "0.90"
      ports:
      - containerPort: 8080
        protocol: TCP
      resources:
        requests:
          nvidia.com/gpu: "1"
          memory: "20Gi"
          cpu: "4"
        limits:
          nvidia.com/gpu: "1"
          memory: "24Gi"
          cpu: "8"
      readinessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 60
        periodSeconds: 10
      env:
      - name: HUGGING_FACE_HUB_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-token
            key: token
    nodeSelector:
      nvidia.com/gpu.present: "true"
  transformer:
    containers:
    - name: kserve-container
      image: kserve/kserve-transformer:latest

Canary Deployment (Traffic Splitting)

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-8b
  namespace: models
spec:
  predictor:
    canaryTrafficPercent: 20    # 20% to new version, 80% to stable
    containers:
    - name: vllm-container
      image: vllm/vllm-openai:latest
      args:
      - "--model"
      - "meta-llama/Llama-3.1-8B-Instruct-v2"  # new model version
      resources:
        limits:
          nvidia.com/gpu: "1"
# Gradually increase canary traffic
kubectl patch inferenceservice llama-3-8b -n models \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/predictor/canaryTrafficPercent","value":50}]'

# Promote canary to stable
kubectl patch inferenceservice llama-3-8b -n models \
  --type='json' \
  -p='[{"op":"remove","path":"/spec/predictor/canaryTrafficPercent"}]'

Autoscaling with KEDA

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llama-scaler
  namespace: models
spec:
  scaleTargetRef:
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    name: llama-3-8b
  minReplicaCount: 1
  maxReplicaCount: 5
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      metricName: kserve_request_count
      threshold: "10"
      query: |
        sum(rate(kserve_request_count_total{namespace="models",
            service="llama-3-8b"}[1m]))

NVIDIA Triton Inference Server

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
  namespace: models
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.05-py3
        args:
        - "tritonserver"
        - "--model-store=s3://my-model-store/models"
        - "--model-control-mode=poll"        # auto-load new model versions
        - "--repository-poll-secs=30"
        - "--metrics-port=8002"
        ports:
        - containerPort: 8000   # HTTP
        - containerPort: 8001   # gRPC
        - containerPort: 8002   # Metrics
        resources:
          limits:
            nvidia.com/gpu: "1"
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 30

Triton Model Repository Structure

s3://my-model-store/models/
├── text-classifier/
│   ├── config.pbtxt
│   ├── 1/
│   │   └── model.onnx
│   └── 2/
│       └── model.onnx          # new version; auto-loaded
├── embedding-model/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx
# config.pbtxt for ONNX model
name: "text-classifier"
backend: "onnxruntime"
max_batch_size: 64
dynamic_batching {
  preferred_batch_size: [16, 32]
  max_queue_delay_microseconds: 1000
}
input [
  { name: "input_ids" data_type: TYPE_INT64 dims: [-1] }
  { name: "attention_mask" data_type: TYPE_INT64 dims: [-1] }
]
output [
  { name: "logits" data_type: TYPE_FP32 dims: [-1] }
]
instance_group [
  { kind: KIND_GPU count: 2 }   # 2 model instances on GPU
]

Model Management Commands

# List loaded models (Triton)
curl http://triton:8000/v2/models

# Load a new model version
curl -X POST http://triton:8000/v2/repository/models/text-classifier/load

# Unload a model
curl -X POST http://triton:8000/v2/repository/models/text-classifier/unload

# KServe — watch rollout status
kubectl rollout status deployment/llama-3-8b-predictor -n models
kubectl get inferenceservice llama-3-8b -n models -w

Common Issues

Issue Cause Fix
InferenceService not ready Model loading or OOM Check predictor pod logs; increase memory limits
Canary stuck at 0% KNative routing issue Check kubectl get ksvc -n models
Triton missing model S3 permissions or path Verify IAM role; check --model-store path
Low GPU utilization Dynamic batching off Enable dynamic_batching in Triton config
Autoscaler not triggering Prometheus query wrong Test query in Prometheus UI

Best Practices

  • Use canary deployments for all model updates — roll back in seconds if metrics degrade.
  • Enable Triton dynamic batching — it can increase GPU throughput 5–10× for small models.
  • Store models in S3/GCS with versioned paths (s3://bucket/model/v1/, v2/).
  • Pin GPU node selectors to prevent model pods landing on CPU-only nodes.
  • Monitor p99 latency and error rates per model version during canary rollouts.

Related Skills

Weekly Installs
2
GitHub Stars
13
First Seen
5 days ago
Installed on
opencode2
antigravity2
claude-code2
github-copilot2
codex2
zencoder2