vllm-server

SKILL.md

vLLM Server Management

Deploy production-grade LLM inference servers with vLLM — the fastest open-source LLM serving engine with PagedAttention and continuous batching.

When to Use This Skill

Use this skill when:

  • Serving open-source LLMs (Llama, Mistral, Qwen, Gemma) at scale
  • Building an OpenAI-compatible API endpoint for self-hosted models
  • Optimizing LLM throughput and latency for production traffic
  • Running multi-GPU inference with tensor or pipeline parallelism
  • Deploying quantized models to reduce GPU memory requirements

Prerequisites

  • NVIDIA GPU(s) with CUDA 12.1+ (A100/H100 recommended for production)
  • Docker or Python 3.9+ with pip
  • 40GB+ VRAM for 70B models; 8GB+ for 7B models
  • nvidia-container-toolkit for Docker GPU passthrough

Quick Start

# Install vLLM
pip install vllm

# Serve a model (OpenAI-compatible API)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key your-secret-key

# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secret-key" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Docker Deployment

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --api-key your-secret-key

Docker Compose (Production)

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    volumes:
      - model-cache:/root/.cache/huggingface
    ports:
      - "8000:8000"
    ipc: host
    command: >
      --model meta-llama/Llama-3.1-70B-Instruct
      --tensor-parallel-size 2
      --max-model-len 32768
      --gpu-memory-utilization 0.90
      --api-key ${VLLM_API_KEY}
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  model-cache:

Key Configuration Options

Multi-GPU Tensor Parallelism

# Split one model across 4 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90

Quantization (Lower VRAM)

# AWQ quantization (70B on 2x A100 40GB)
vllm serve casperhansen/llama-3-70b-instruct-awq \
  --quantization awq \
  --tensor-parallel-size 2

# GPTQ quantization
vllm serve TheBloke/Llama-2-70B-Chat-GPTQ \
  --quantization gptq

# FP8 (H100 NVL native)
vllm serve meta-llama/Llama-3.1-405B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 8

Structured Output & Tools

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --guided-decoding-backend outlines

LoRA Adapters

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules sql-lora=/path/to/sql-lora \
                 code-lora=/path/to/code-lora \
  --max-lora-rank 64

Performance Tuning

# Maximize throughput for batch workloads
vllm serve <model> \
  --max-num-seqs 256 \          # max concurrent sequences
  --max-num-batched-tokens 8192 \ # tokens per batch
  --gpu-memory-utilization 0.95 \ # use 95% VRAM
  --swap-space 4                  # CPU swap (GiB)

# Minimize latency for interactive use
vllm serve <model> \
  --max-num-seqs 32 \
  --enforce-eager              # disable CUDA graph capture

Benchmarking

# Install benchmark tool
pip install vllm

# Run throughput benchmark
python -m vllm.entrypoints.openai.run_batch \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-file prompts.jsonl \
  --output-file results.jsonl

# Benchmark with vllm bench
vllm bench throughput \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 1000 \
  --input-len 512 \
  --output-len 128

Monitoring

# Check running server stats
curl http://localhost:8000/metrics  # Prometheus metrics

# Key metrics to watch:
# vllm:num_requests_running       - active requests
# vllm:gpu_cache_usage_perc       - KV cache utilization
# vllm:generation_tokens_per_s    - throughput
# vllm:time_to_first_token_ms     - TTFT latency
# vllm:e2e_request_latency_seconds - end-to-end latency

Common Issues

Issue Cause Fix
CUDA out of memory Model too large for VRAM Add --quantization awq or reduce --gpu-memory-utilization
Slow cold start Model not cached Pre-pull with huggingface-cli download <model>
Low throughput Too few concurrent requests Increase --max-num-seqs
KV cache full errors Context length too long Set --max-model-len lower
tokenizer error Tokenizer mismatch Use --tokenizer to specify correct tokenizer

Best Practices

  • Use --gpu-memory-utilization 0.90 to leave headroom for CUDA kernels.
  • Pin model versions with --revision for reproducible deployments.
  • Set HF_HUB_OFFLINE=1 in production to prevent unexpected downloads.
  • Use AWQ or GPTQ quantization before tensor parallelism — lower VRAM first.
  • Enable --enable-chunked-prefill for long-context workloads.
  • Monitor gpu_cache_usage_perc — above 95% causes queuing.

Related Skills

Weekly Installs
2
GitHub Stars
13
First Seen
4 days ago
Installed on
opencode2
antigravity2
claude-code2
github-copilot2
codex2
zencoder2