vllm-deploy-simple
vLLM Simple Deployment
A simple skill to quickly install vLLM, start a server, and validate the OpenAI-compatible API.
What this skill does
This skill provides a streamlined workflow to:
- Detect hardware backend (NVIDIA CUDA, AMD ROCm, Google TPU, or CPU)
- Install vLLM with appropriate backend support
- Start the vLLM server with configurable model and port
- Test the OpenAI-compatible API endpoint
- Validate the deployment is working correctly
- Support virtual environment isolation
Prerequisites
- Python 3.10+
- GPU (NVIDIA CUDA, AMD ROCm) (recommended) or TPU or CPU
- pip or uv package manager
- curl (for API testing)
- Virtual environment (optional but recommended)
Usage
Create a venv
If user did not specify the venv path or asked to deploy in the current environment, create a venv using uv with python 3.12 in the current folder. If uv not found, make a folder in this path and use python to create a virtual environment.
Run the complete workflow (suggested)
If user did not specify the venv path, model, or port, use default options:
# Default deployment options (--venv "." --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8)
scripts/quickstart.sh
Or with custom options:
# Use custom virtual environment
scripts/quickstart.sh --venv /path/to/venv
# Use custom model and port
scripts/quickstart.sh --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000
# Use custom GPU memory utilization
scripts/quickstart.sh --gpu_memory_utilization 0.6
# Combine all options
scripts/quickstart.sh --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8
This will:
- Activate the virtual environment (if specified)
- Detect hardware backend (CUDA/ROCm/TPU/CPU)
- Install vLLM with appropriate backend support
- Start the vLLM server in the background
- Wait for the server to be ready
- Test the API with a sample request
- Display the server status
Run individual commands (for step-by-step usage or troubleshooting)
Install vLLM:
scripts/quickstart.sh install
# Or with virtual environment
scripts/quickstart.sh install --venv /path/to/venv
Start the server:
scripts/quickstart.sh start
# Or with custom options
scripts/quickstart.sh start --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8
Test the API:
scripts/quickstart.sh test
# Or with custom port
scripts/quickstart.sh test --port 8000
Stop the server:
scripts/quickstart.sh stop
# Or with virtual environment
scripts/quickstart.sh stop --venv /path/to/venv
Check server status:
scripts/quickstart.sh status
Restart the server:
scripts/quickstart.sh restart
# Or with custom options
scripts/quickstart.sh restart --venv /path/to/venv --port 8000 --gpu_memory_utilization 0.8
Configuration
The script supports the following command-line options:
scripts/quickstart.sh [command] [OPTIONS]
Commands:
install - Install vLLM and dependencies
start - Start the vLLM server
stop - Stop the vLLM server
test - Test the OpenAI-compatible API
status - Show server status
restart - Restart the server
all - Run complete workflow (default)
Options:
--model MODEL Model to use (default: Qwen/Qwen2.5-1.5B-Instruct)
--port PORT Port to run server on (default: 8000)
--venv VENV_PATH Virtual environment path (default: .)
--gpu_memory_utilization VRAM GPU memory utilization (default: 0.8)
Hardware Backend Detection
The script automatically detects your hardware and installs the appropriate vLLM version:
- NVIDIA CUDA: Detected via
nvidia-smicommand - AMD ROCm: Detected via
/dev/kfdand/dev/dridevices - Google TPU: Detected via
TPU_NAMEenvironment variable orgcloudcommand - CPU: Fallback if no GPU/TPU detected
For Google TPU, the script installs vllm-tpu instead of the standard vllm package.
API Testing
The test script sends a simple chat completion request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [{"role": "user", "content": "Say hello!"}],
"max_tokens": 50
}'
Troubleshooting
Virtual environment not found:
- Ensure the path provided with
--venvexists and is a valid virtual environment - Check that the activation script exists (
bin/activateon Linux/macOS orScripts/activateon Windows) - Check and install uv, and create a new virtual environment with uv:
uv venv /path/to/venv(suggested); or with pip:python3 -m venv /path/to/venv
Server won't start:
- Check if the port is already in use:
lsof -i :8000 - Verify GPU availability:
nvidia-smi(for NVIDIA) orrocm-smi(for AMD) - Check vLLM installation:
python -c "import vllm; print(vllm.__version__)" - Review server logs at
$VENV_PATH/tmp/vllm-server.log
API returns errors:
- Wait a few seconds for the model to load
- Check server logs:
cat $VENV_PATH/tmp/vllm-server.log - Verify the server is running:
scripts/quickstart.sh status
Out of memory:
- Use a smaller model (e.g., Qwen2.5-0.5B-Instruct)
- Reduce
--gpu-memory-utilizationparameter - Close other GPU-intensive applications
Wrong backend detected:
- For NVIDIA: Ensure
nvidia-smiis in your PATH - For AMD: Check that ROCm drivers are properly installed
- For TPU: Set
TPU_NAMEenvironment variable or installgcloud
Notes
- The server runs in the background and logs to
$VENV_PATH/tmp/vllm-server.log - The PID is stored in
$VENV_PATH/tmp/vllm-server.pidfor easy management - First run will download the model (~3GB for Qwen2.5-1.5B-Instruct)
- Subsequent runs will use the cached model
- The script automatically detects and uses
uvif available, otherwise falls back topip - Virtual environment support allows isolation from system Python packages
- Arguments can be specified in any order (e.g.,
scripts/quickstart.sh --port 8080 start --venv /path/to/venv)
More from vllm-project/vllm-skills
vllm-deploy-docker
Deploy vLLM using Docker (pre-built images or build-from-source) with NVIDIA GPU support and run the OpenAI-compatible server.
58vllm-deploy-k8s
Deploy vLLM to Kubernetes (K8s) with GPU support, health probes, and OpenAI-compatible API endpoint. Use this skill whenever the user wants to deploy, run, or serve vLLM on a Kubernetes cluster, including creating deployments, services, checking existing deployments, or managing vLLM on K8s.
28vllm-bench-serve
Benchmark vLLM or OpenAI-compatible serving endpoints using vllm bench serve. Supports multiple datasets (random, sharegpt, sonnet, HF), backends (openai, openai-chat, vllm-pooling, embeddings), throughput/latency testing with request-rate control, and result saving. Use when benchmarking LLM serving performance, measuring TTFT/TPOT, or load testing inference APIs.
27vllm-prefix-cache-bench
This is a skill for benchmarking the efficiency of automatic prefix caching in vLLM using fixed prompts, real-world datasets, or synthetic prefix/suffix patterns. Use when the user asks to benchmark prefix caching hit rate, caching efficiency, or repeated-prompt performance in vLLM.
26vllm-bench-random-synthetic
Run vLLM performance benchmark using synthetic random data to measure throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and other key performance metrics. Use when the user wants to quickly test vLLM serving performance without downloading external datasets.
26