vllm-deploy-k8s
vLLM Kubernetes Deployment
A Claude skill for deploying vLLM to Kubernetes using YAML templates. Deploys a vLLM OpenAI-compatible server as a Kubernetes Deployment with a ClusterIP Service, GPU resources, and health probes.
What this skill does
- Deploy vLLM as a Kubernetes Deployment + Service with NVIDIA GPU support
- Check if a vLLM deployment already exists before deploying
- Check if the Hugging Face token secret exists, and ask the user for their token if not
- Use the
vllm/vllm-openai:latestimage by default (user can specify a different version) - Provide sensible default configuration that users can customize (model, replicas, GPU count, extra vLLM flags, etc.)
Prerequisites
kubectlconfigured with access to a Kubernetes cluster- NVIDIA GPU Operator or device plugin installed on cluster nodes
- Hugging Face token (required for gated models like Llama, optional for public models)
Deployment Steps
Step 1: Check HF token secret
Before deploying, check if the hf-token Kubernetes secret exists in the target namespace:
kubectl get secret hf-token -n <namespace>
- If the secret exists: proceed to Step 2.
- If the secret does not exist: ask the user to provide their Hugging Face token, then create the secret:
kubectl create secret generic hf-token --from-literal=HF_TOKEN="<user-provided-token>" -n <namespace>
This is required for gated models (e.g., meta-llama/Meta-Llama-3.1-8B). For public models, the secret is optional but recommended to avoid rate limits.
Step 2: Check if deployment already exists
Before applying, check if a vLLM deployment already exists:
kubectl get deployment vllm -n <namespace>
- If it exists: inform the user that the deployment already exists. Show the current image and status. Ask the user if they want to update it or skip.
- If it does not exist: proceed to deploy.
Step 3: Deploy
Apply the template YAML files to deploy vLLM:
kubectl apply -f templates/vllm-service.yaml -n <namespace>
kubectl apply -f templates/vllm-deployment.yaml -n <namespace>
Step 4: Wait and verify
Wait for the deployment to roll out:
kubectl rollout status deployment/vllm -n <namespace> --timeout=600s
Verify the pod is running and ready:
kubectl get pods -n <namespace> -l app=vllm
Confirm the pod shows READY 1/1 and STATUS Running. If the pod is not ready yet, wait and check again. If it's in CrashLoopBackOff or Error, check the logs with kubectl logs -n <namespace> -l app=vllm.
Step 5: Print deployment summary
Once the pod is ready, print a summary message to the user in this format (replace placeholders with actual values):
🎉 **vLLM Deployment Successful!**
| Resource | Name | Status |
|----------|------|--------|
| Deployment | <deployment-name> | <ready>/<total> Ready |
| Service | <service-name> | ClusterIP:<port> |
| Pod | <pod-name> | Running |
| Image | <image> | |
| Model | <model> | |
**To test the API, run these two commands in your terminal:**
**1. Open a port-forward** (this connects your local port <port> to the vLLM service inside the cluster):
kubectl port-forward svc/vllm-svc <port>:<port> -n <namespace>
**2. In a separate terminal**, send a test request to the OpenAI-compatible API:
curl -s http://localhost:<port>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"<model>","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}' | python3 -m json.tool
If everything is working, you'll get a JSON response with the model's reply.
Default Configuration
The templates use the following defaults:
| Parameter | Default Value |
|---|---|
| Image | vllm/vllm-openai:latest |
| Model | Qwen/Qwen2.5-1.5B-Instruct |
| Port | 8000 |
| Replicas | 1 |
| GPU count | 1 |
| GPU memory utilization | 0.85 |
| Tensor parallel size | 1 |
| CPU request / limit | 12 / 128 |
| Memory request / limit | 100Gi / 400Gi |
| Shared memory (dshm) | 80Gi |
Customization
When the user requests changes, modify the template YAML files before applying. The following can be customized:
- Image version: Change
image: vllm/vllm-openai:<version>intemplates/vllm-deployment.yaml(default:latest). Use a specific version tag likev0.17.1if the user requests it. - Model: Change the model name in the
vllm servecommand inside the Deploymentargs. - Extra vLLM flags: Append additional flags to the
vllm servecommand in the Deploymentargs(e.g.,--max-model-len 4096,--kv-cache-dtype fp8,--enforce-eager,--generation-config vllm). - Replicas: Change
replicas:in the Deployment spec. - GPU count: Change
nvidia.com/gpuin bothrequestsandlimitsunder resources. - Tensor parallel size: Change
--tensor-parallel-sizeflag to match the GPU count. - CPU/Memory resources: Change
cpuandmemoryvalues underrequestsandlimits. - Port: Change
containerPortin the Deployment,port/targetPortin the Service, theportin all health probes (liveness, readiness, startup), AND add--port <port>to thevllm servecommand in args. All four must match. - Namespace: Apply to a specific namespace using
-n <namespace>. - Shared memory size: Change the
sizeLimitof thedshmemptyDir volume.
Edit the template files using the Edit tool, then apply the modified templates.
Status Check
kubectl get deployment,svc,pods -n <namespace> -l app=vllm
Cleanup
When the user asks to clean up or delete the vLLM deployment, run the following steps:
- Delete the Deployment and Service:
kubectl delete -f templates/vllm-deployment.yaml -n <namespace>
kubectl delete -f templates/vllm-service.yaml -n <namespace>
- Ask the user if they also want to delete the HF token secret. If yes:
kubectl delete secret hf-token -n <namespace>
- Verify everything is cleaned up:
kubectl get deployment,svc,pods -n <namespace> -l app=vllm
- Print a summary message to the user:
vLLM deployment has been cleaned up from namespace <namespace>.
Deleted: Deployment/vllm, Service/vllm-svc
HF token secret: <kept/deleted>
Troubleshooting
- Pod stuck in Pending: No GPU nodes available. Check
kubectl describe pod <pod-name>for scheduling errors. Ensure NVIDIA GPU Operator or device plugin is installed. - Pod OOMKilled: Increase
memorylimits in the Deployment, or use a smaller model. - ImagePullBackOff: Check the image name and tag. Verify the node has access to Docker Hub / the container registry.
- Startup probe failures (CrashLoopBackOff): Model download may be slow. Check logs with
kubectl logs <pod-name>. Ensurehf-tokensecret exists for gated models. IncreasefailureThresholdon the startup probe if needed. - HF_TOKEN not working: Verify the secret exists:
kubectl get secret hf-token -n <namespace>. Check the token is valid. - GPU not detected in container: Ensure
nvidia.com/gpuresource is requested and the NVIDIA device plugin is running on the node.
References
More from vllm-project/vllm-skills
vllm-deploy-docker
Deploy vLLM using Docker (pre-built images or build-from-source) with NVIDIA GPU support and run the OpenAI-compatible server.
58vllm-deploy-simple
Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API.
39vllm-bench-serve
Benchmark vLLM or OpenAI-compatible serving endpoints using vllm bench serve. Supports multiple datasets (random, sharegpt, sonnet, HF), backends (openai, openai-chat, vllm-pooling, embeddings), throughput/latency testing with request-rate control, and result saving. Use when benchmarking LLM serving performance, measuring TTFT/TPOT, or load testing inference APIs.
27vllm-prefix-cache-bench
This is a skill for benchmarking the efficiency of automatic prefix caching in vLLM using fixed prompts, real-world datasets, or synthetic prefix/suffix patterns. Use when the user asks to benchmark prefix caching hit rate, caching efficiency, or repeated-prompt performance in vLLM.
26vllm-bench-random-synthetic
Run vLLM performance benchmark using synthetic random data to measure throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and other key performance metrics. Use when the user wants to quickly test vLLM serving performance without downloading external datasets.
26