vllm-server
SKILL.md
vLLM Server Management
Deploy production-grade LLM inference servers with vLLM — the fastest open-source LLM serving engine with PagedAttention and continuous batching.
When to Use This Skill
Use this skill when:
- Serving open-source LLMs (Llama, Mistral, Qwen, Gemma) at scale
- Building an OpenAI-compatible API endpoint for self-hosted models
- Optimizing LLM throughput and latency for production traffic
- Running multi-GPU inference with tensor or pipeline parallelism
- Deploying quantized models to reduce GPU memory requirements
Prerequisites
- NVIDIA GPU(s) with CUDA 12.1+ (A100/H100 recommended for production)
- Docker or Python 3.9+ with pip
- 40GB+ VRAM for 70B models; 8GB+ for 7B models
nvidia-container-toolkitfor Docker GPU passthrough
Quick Start
# Install vLLM
pip install vllm
# Serve a model (OpenAI-compatible API)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key
# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-key" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Docker Deployment
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--api-key your-secret-key
Docker Compose (Production)
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
volumes:
- model-cache:/root/.cache/huggingface
ports:
- "8000:8000"
ipc: host
command: >
--model meta-llama/Llama-3.1-70B-Instruct
--tensor-parallel-size 2
--max-model-len 32768
--gpu-memory-utilization 0.90
--api-key ${VLLM_API_KEY}
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
volumes:
model-cache:
Key Configuration Options
Multi-GPU Tensor Parallelism
# Split one model across 4 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90
Quantization (Lower VRAM)
# AWQ quantization (70B on 2x A100 40GB)
vllm serve casperhansen/llama-3-70b-instruct-awq \
--quantization awq \
--tensor-parallel-size 2
# GPTQ quantization
vllm serve TheBloke/Llama-2-70B-Chat-GPTQ \
--quantization gptq
# FP8 (H100 NVL native)
vllm serve meta-llama/Llama-3.1-405B-Instruct \
--quantization fp8 \
--tensor-parallel-size 8
Structured Output & Tools
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-auto-tool-choice \
--tool-call-parser llama3_json \
--guided-decoding-backend outlines
LoRA Adapters
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-lora \
--lora-modules sql-lora=/path/to/sql-lora \
code-lora=/path/to/code-lora \
--max-lora-rank 64
Performance Tuning
# Maximize throughput for batch workloads
vllm serve <model> \
--max-num-seqs 256 \ # max concurrent sequences
--max-num-batched-tokens 8192 \ # tokens per batch
--gpu-memory-utilization 0.95 \ # use 95% VRAM
--swap-space 4 # CPU swap (GiB)
# Minimize latency for interactive use
vllm serve <model> \
--max-num-seqs 32 \
--enforce-eager # disable CUDA graph capture
Benchmarking
# Install benchmark tool
pip install vllm
# Run throughput benchmark
python -m vllm.entrypoints.openai.run_batch \
--model meta-llama/Llama-3.1-8B-Instruct \
--input-file prompts.jsonl \
--output-file results.jsonl
# Benchmark with vllm bench
vllm bench throughput \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 1000 \
--input-len 512 \
--output-len 128
Monitoring
# Check running server stats
curl http://localhost:8000/metrics # Prometheus metrics
# Key metrics to watch:
# vllm:num_requests_running - active requests
# vllm:gpu_cache_usage_perc - KV cache utilization
# vllm:generation_tokens_per_s - throughput
# vllm:time_to_first_token_ms - TTFT latency
# vllm:e2e_request_latency_seconds - end-to-end latency
Common Issues
| Issue | Cause | Fix |
|---|---|---|
CUDA out of memory |
Model too large for VRAM | Add --quantization awq or reduce --gpu-memory-utilization |
| Slow cold start | Model not cached | Pre-pull with huggingface-cli download <model> |
| Low throughput | Too few concurrent requests | Increase --max-num-seqs |
| KV cache full errors | Context length too long | Set --max-model-len lower |
tokenizer error |
Tokenizer mismatch | Use --tokenizer to specify correct tokenizer |
Best Practices
- Use
--gpu-memory-utilization 0.90to leave headroom for CUDA kernels. - Pin model versions with
--revisionfor reproducible deployments. - Set
HF_HUB_OFFLINE=1in production to prevent unexpected downloads. - Use AWQ or GPTQ quantization before tensor parallelism — lower VRAM first.
- Enable
--enable-chunked-prefillfor long-context workloads. - Monitor
gpu_cache_usage_perc— above 95% causes queuing.
Related Skills
- llm-inference-scaling - Auto-scaling vLLM deployments
- gpu-server-management - GPU driver setup
- llm-gateway - Load balancing across vLLM instances
- llm-cost-optimization - Cost management
- model-serving-kubernetes - K8s deployment
Weekly Installs
2
Repository
bagelhole/devop…t-skillsGitHub Stars
13
First Seen
4 days ago
Security Audits
Installed on
opencode2
antigravity2
claude-code2
github-copilot2
codex2
zencoder2