AI/ML Infrastructure

Model serving with KubeAI, GPU scheduling, and inference patterns.

Model Deployment Options

Feature	KubeAI	Ollama Operator	LlamaStack
Backend	vLLM (GPU optimized)	Ollama (easy)	Multi-backend
Scale from zero	Yes	No	No
OpenAI API	Native	Compatible	Compatible
Best for	Production GPU	CPU/mixed	Full AI stack

KubeAI Setup

Model CRD

apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: llama-3-8b
  namespace: kubeai
spec:
  features: [TextGeneration]
  url: "ollama://llama3.1:8b"
  engine: OLlama
  resourceProfile: nvidia-gpu-l4:1
  minReplicas: 0      # Scale to zero
  maxReplicas: 3
  targetRequests: 10  # Scale up threshold

Resource Profiles

Profile	GPUs	VRAM	Use Case
`cpu`	0	-	Embeddings, small models
`nvidia-gpu-l4:1`	1x L4	24GB	8B models
`nvidia-gpu-h100:1`	1x H100	80GB	70B models
`nvidia-gpu-h100:2`	2x H100	160GB	Large models

Custom Resource Profile

resourceProfiles:
  nvidia-gpu-l4:
    nodeSelector:
      nvidia.com/gpu.product: "NVIDIA-L4"
    requests:
      cpu: "4"
      memory: "16Gi"
    limits:
      nvidia.com/gpu: "1"
      cpu: "8"
      memory: "32Gi"

Accessing Models

OpenAI-Compatible API

# Port-forward
kubectl port-forward svc/kubeai -n kubeai 8000:80

# List models
curl http://localhost:8000/openai/v1/models

# Chat completion
curl http://localhost:8000/openai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

In-Cluster Access

env:
  - name: OPENAI_API_BASE
    value: "http://kubeai.kubeai.svc/openai/v1"
  - name: OPENAI_API_KEY
    value: "not-needed"  # KubeAI doesn't require auth

SDK Usage

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://kubeai.kubeai.svc/openai/v1",
  apiKey: "not-needed",
});

const response = await client.chat.completions.create({
  model: "llama-3-8b",
  messages: [{ role: "user", content: "Hello!" }],
});

GPU Operator

NVIDIA GPU Operator manages GPU drivers and device plugins.

Verify GPU Nodes

# Check GPU nodes
kubectl get nodes -l nvidia.com/gpu.product

# Check GPU allocations
kubectl describe node <gpu-node> | grep nvidia.com/gpu

# Check device plugin
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset

GPU Pod Scheduling

spec:
  containers:
    - name: gpu-app
      resources:
        limits:
          nvidia.com/gpu: 1
  nodeSelector:
    nvidia.com/gpu.product: "NVIDIA-L4"

Model Selection Guide

Model	Size	GPU Req	Best For
`llama3.1:8b`	8B	L4 x1	General, coding
`llama3.1:70b`	70B	H100 x2	Complex reasoning
`qwen2.5-coder`	7B	L4 x1	Code generation
`nomic-embed-text`	137M	CPU	Embeddings
`deepseek-r1`	1.5B	CPU	Light reasoning

Ollama Operator (Alternative)

Simpler setup for Ollama models:

apiVersion: ollama.ayaka.io/v1
kind: Model
metadata:
  name: phi4
  namespace: ollama-operator-system
spec:
  image: phi4
  resources:
    limits:
      nvidia.com/gpu: "1"

Access:

kubectl port-forward svc/ollama-model-phi4 -n ollama-operator-system 11434:11434
ollama run phi4

Validation Commands

# Check KubeAI models
kubectl get models -n kubeai
kubectl describe model <name> -n kubeai

# Check model pods
kubectl get pods -n kubeai -l app.kubernetes.io/name=kubeai

# Check GPU utilization
kubectl exec -n kubeai <pod> -- nvidia-smi

# Test API
curl http://kubeai.kubeai.svc/openai/v1/models

Troubleshooting

Model not starting

# Check model status
kubectl describe model <name> -n kubeai

# Check pod events
kubectl get events -n kubeai --sort-by='.lastTimestamp'

# Check logs
kubectl logs -n kubeai -l model=<name>

Out of memory (OOM)

Reduce model parameters:

spec:
  args:
    - --max-model-len=4096      # Reduce from 8192
    - --gpu-memory-utilization=0.8  # Reduce from 0.9

Slow first response

Set minReplicas to keep model warm:

spec:
  minReplicas: 1  # Always keep one running

Best Practices

Use scale-from-zero - Set minReplicas: 0 to save resources
Right-size GPU profiles - Don't over-allocate expensive GPUs
Use vLLM for production - Better throughput than Ollama
Monitor GPU memory - Set appropriate gpu-memory-utilization
Keep frequently-used models warm - minReplicas: 1
Use OpenAI-compatible API - Easy integration with existing code

ai-ml-infra