gke-inference-quickstart

Installation
SKILL.md

GKE Inference Quickstart (GIQ)

Purpose

This skill guides the deployment of AI/ML inference workloads on GKE using GIQ. It leverages gcloud container ai profiles manifests create to create optimized Kubernetes manifests based on Google's best practices and benchmarks.

When to Use

  • Goal: Deploy an AI model (e.g., Llama, Gemma, Mistral) to GKE.
  • Goal: Generate a Kubernetes manifest for inference.
  • Context: User asks about "GIQ", "Inference Quickstart", or "AI benchmarks" on GKE.

Prerequisites

  • A GKE cluster (preferably with GPU/TPU node pools, though GIQ can help identify requirements).
  • gcloud CLI installed and authenticated (for discovery commands).

Workflow

1. Discovery: Find Models and Hardware

Before generating a manifest, you often need to pick a valid combination of Model, Model Server, and Accelerator.

List all supported models:

gcloud container ai profiles models list

Find valid accelerators and servers for a specific model:

# Replace <MODEL_NAME> with a model from the list above (e.g., 'gemma-2-9b-it')
gcloud container ai profiles list --model=<MODEL_NAME>

View benchmarks/profiles (optional): To see costs and latency targets:

gcloud container ai profiles list --model=<MODEL_NAME>

2. Generate Manifest

Use the gcloud container ai profiles manifests create command. This ensures you are using the latest supported flags and options directly from the CLI.

Parameters:

  • --model: The model ID (e.g., gemma-2-9b-it).
  • --model-server: The inference server (e.g., vllm, tgi, triton, tensorrt-llm).
  • --accelerator-type: The accelerator type (e.g., nvidia-l4, nvidia-tesla-a100).
  • --target-ntpot-milliseconds: (Optional) Target Normalized Time Per Output Token in ms.

Example Command:

gcloud container ai profiles manifests create \
  --model=gemma-2-9b-it \
  --model-server=vllm \
  --accelerator-type=nvidia-l4 \
  --target-ntpot-milliseconds=50 > inference-workload.yaml

3. Review and Deploy

  1. Save: The example command above saves output to inference-workload.yaml. Ensure you have this file.
  2. Review: Check for any placeholders or specific requirements (like PVCs or secrets).
    • Note: Some models require Hugging Face tokens. Ensure query instructions for secrets are followed.
  3. Deploy:
    kubectl apply -f inference-workload.yaml
    

Troubleshooting

  • Invalid Combination: If the manifest creation fails with an invalid combination error, re-run the discovery commands in Step 1 to verify the tuple (model, server, accelerator).
  • Quota Issues: Ensure the target region has sufficient quota for the requested accelerator (e.g., NVIDIA_L4_GPUS).

Reference

Related skills
Installs
2
GitHub Stars
148
First Seen
2 days ago