vLLM Server Management

Deploy production-grade LLM inference servers with vLLM — the fastest open-source LLM serving engine with PagedAttention and continuous batching.

When to Use This Skill

Use this skill when:

Serving open-source LLMs (Llama, Mistral, Qwen, Gemma) at scale
Building an OpenAI-compatible API endpoint for self-hosted models
Optimizing LLM throughput and latency for production traffic
Running multi-GPU inference with tensor or pipeline parallelism
Deploying quantized models to reduce GPU memory requirements

Prerequisites

NVIDIA GPU(s) with CUDA 12.1+ (A100/H100 recommended for production)
Docker or Python 3.9+ with pip
40GB+ VRAM for 70B models; 8GB+ for 7B models
nvidia-container-toolkit for Docker GPU passthrough

Quick Start

# Install vLLM
pip install vllm

# Serve a model (OpenAI-compatible API)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key your-secret-key

# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secret-key" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Docker Deployment

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --api-key your-secret-key

Docker Compose (Production)

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    volumes:
      - model-cache:/root/.cache/huggingface
    ports:
      - "8000:8000"
    ipc: host
    command: >
      --model meta-llama/Llama-3.1-70B-Instruct
      --tensor-parallel-size 2
      --max-model-len 32768
      --gpu-memory-utilization 0.90
      --api-key ${VLLM_API_KEY}
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  model-cache:

Key Configuration Options

Multi-GPU Tensor Parallelism

# Split one model across 4 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90

Quantization (Lower VRAM)

# AWQ quantization (70B on 2x A100 40GB)
vllm serve casperhansen/llama-3-70b-instruct-awq \
  --quantization awq \
  --tensor-parallel-size 2

# GPTQ quantization
vllm serve TheBloke/Llama-2-70B-Chat-GPTQ \
  --quantization gptq

# FP8 (H100 NVL native)
vllm serve meta-llama/Llama-3.1-405B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 8

Structured Output & Tools

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --guided-decoding-backend outlines

LoRA Adapters

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules sql-lora=/path/to/sql-lora \
                 code-lora=/path/to/code-lora \
  --max-lora-rank 64

Performance Tuning

# Maximize throughput for batch workloads
vllm serve <model> \
  --max-num-seqs 256 \          # max concurrent sequences
  --max-num-batched-tokens 8192 \ # tokens per batch
  --gpu-memory-utilization 0.95 \ # use 95% VRAM
  --swap-space 4                  # CPU swap (GiB)

# Minimize latency for interactive use
vllm serve <model> \
  --max-num-seqs 32 \
  --enforce-eager              # disable CUDA graph capture

Benchmarking

# Install benchmark tool
pip install vllm

# Run throughput benchmark
python -m vllm.entrypoints.openai.run_batch \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-file prompts.jsonl \
  --output-file results.jsonl

# Benchmark with vllm bench
vllm bench throughput \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 1000 \
  --input-len 512 \
  --output-len 128

Monitoring

# Check running server stats
curl http://localhost:8000/metrics  # Prometheus metrics

# Key metrics to watch:
# vllm:num_requests_running       - active requests
# vllm:gpu_cache_usage_perc       - KV cache utilization
# vllm:generation_tokens_per_s    - throughput
# vllm:time_to_first_token_ms     - TTFT latency
# vllm:e2e_request_latency_seconds - end-to-end latency

Common Issues

Issue	Cause	Fix
`CUDA out of memory`	Model too large for VRAM	Add `--quantization awq` or reduce `--gpu-memory-utilization`
Slow cold start	Model not cached	Pre-pull with `huggingface-cli download <model>`
Low throughput	Too few concurrent requests	Increase `--max-num-seqs`
KV cache full errors	Context length too long	Set `--max-model-len` lower
`tokenizer error`	Tokenizer mismatch	Use `--tokenizer` to specify correct tokenizer

Best Practices

Use --gpu-memory-utilization 0.90 to leave headroom for CUDA kernels.
Pin model versions with --revision for reproducible deployments.
Set HF_HUB_OFFLINE=1 in production to prevent unexpected downloads.
Use AWQ or GPTQ quantization before tensor parallelism — lower VRAM first.
Enable --enable-chunked-prefill for long-context workloads.
Monitor gpu_cache_usage_perc — above 95% causes queuing.

Related Skills

llm-inference-scaling - Auto-scaling vLLM deployments
gpu-server-management - GPU driver setup
llm-gateway - Load balancing across vLLM instances
llm-cost-optimization - Cost management
model-serving-kubernetes - K8s deployment

vllm-server

vLLM Server Management

When to Use This Skill

Prerequisites

Quick Start

Docker Deployment

Docker Compose (Production)

Key Configuration Options

Multi-GPU Tensor Parallelism

Quantization (Lower VRAM)

Structured Output & Tools

LoRA Adapters

Performance Tuning

Benchmarking

Monitoring

Common Issues

Best Practices

Related Skills