gke-inference-quickstart
GKE Inference Quickstart (GIQ)
Purpose
This skill guides the deployment of AI/ML inference workloads on GKE using GIQ. It leverages gcloud container ai profiles manifests create to create optimized Kubernetes manifests based on Google's best practices and benchmarks.
When to Use
- Goal: Deploy an AI model (e.g., Llama, Gemma, Mistral) to GKE.
- Goal: Generate a Kubernetes manifest for inference.
- Context: User asks about "GIQ", "Inference Quickstart", or "AI benchmarks" on GKE.
Prerequisites
- A GKE cluster (preferably with GPU/TPU node pools, though GIQ can help identify requirements).
gcloudCLI installed and authenticated (for discovery commands).
Workflow
1. Discovery: Find Models and Hardware
Before generating a manifest, you often need to pick a valid combination of Model, Model Server, and Accelerator.
List all supported models:
gcloud container ai profiles models list
Find valid accelerators and servers for a specific model:
# Replace <MODEL_NAME> with a model from the list above (e.g., 'gemma-2-9b-it')
gcloud container ai profiles list --model=<MODEL_NAME>
View benchmarks/profiles (optional): To see costs and latency targets:
gcloud container ai profiles list --model=<MODEL_NAME>
2. Generate Manifest
Use the gcloud container ai profiles manifests create command. This ensures you are using the latest supported flags and options directly from the CLI.
Parameters:
--model: The model ID (e.g.,gemma-2-9b-it).--model-server: The inference server (e.g.,vllm,tgi,triton,tensorrt-llm).--accelerator-type: The accelerator type (e.g.,nvidia-l4,nvidia-tesla-a100).--target-ntpot-milliseconds: (Optional) Target Normalized Time Per Output Token in ms.
Example Command:
gcloud container ai profiles manifests create \
--model=gemma-2-9b-it \
--model-server=vllm \
--accelerator-type=nvidia-l4 \
--target-ntpot-milliseconds=50 > inference-workload.yaml
3. Review and Deploy
- Save: The example command above saves output to
inference-workload.yaml. Ensure you have this file. - Review: Check for any placeholders or specific requirements (like PVCs or secrets).
- Note: Some models require Hugging Face tokens. Ensure query instructions for secrets are followed.
- Deploy:
kubectl apply -f inference-workload.yaml
Troubleshooting
- Invalid Combination: If the manifest creation fails with an invalid combination error, re-run the discovery commands in Step 1 to verify the tuple (model, server, accelerator).
- Quota Issues: Ensure the target region has sufficient quota for the requested accelerator (e.g.,
NVIDIA_L4_GPUS).
Reference
More from googlecloudplatform/gke-mcp
gke-backup-dr
Workflows for configuring Backup for GKE and disaster recovery.
2gke-reliability
Workflows for ensuring high availability and reliability of GKE workloads.
2gke-storage
Guidance on managing storage in Google Kubernetes Engine (GKE) clusters.
2gke-app-onboarding
Workflows for containerizing and deploying applications to GKE for the first time.
2gke-workload-security
Workflows for auditing and hardening the security of GKE workloads.
2gke-cost-optimization
Guidance on optimizing costs for Google Kubernetes Engine (GKE) clusters.
2