vllm-bench-random-synthetic
vLLM Benchmark with Random Synthetic Data
Run a quick performance benchmark on a vLLM server using synthetic random data. This skill measures core serving metrics including request throughput, token throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and inter-token latency.
When to use
- User wants to quickly benchmark vLLM serving performance
- User wants to measure throughput and latency metrics without downloading datasets
- User wants to test a vLLM deployment with synthetic workload
- User wants baseline performance numbers for a specific model
Prerequisites
- vLLM must be installed (
pip install vllm) - A vLLM server must be running (or can be started as part of the benchmark)
- For GPU models, NVIDIA GPU with appropriate drivers must be available
Quick Start
The simplest way to run the benchmark:
# Start vLLM server (in background or separate terminal)
vllm serve Qwen/Qwen2.5-1.5B-Instruct
# Run benchmark with random synthetic data
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 10
Note:
- Use
--backend openai-chatwith endpoint/v1/chat/completionsfor online benchmarks.
Parameters
| Parameter | Description | Default |
|---|---|---|
--backend |
Backend type: vllm, openai, openai-chat |
vllm |
--model |
Model name (must match the server) | Required |
--endpoint |
API endpoint path | /v1/completions or /v1/chat/completions |
--dataset-name |
Dataset to use | random (synthetic) |
--num-prompts |
Number of requests to send | 10 |
--port |
Server port | 8000 |
--max-concurrency |
Maximum concurrent requests | Auto |
--save-result |
Save results to file | Off |
--result-dir |
Directory to save results | ./ |
Expected Output
When successful, you will see output like:
============ Serving Benchmark Result ============
Successful requests: 10
Benchmark duration (s): 5.78
Total input tokens: 1369
Total generated tokens: 2212
Request throughput (req/s): 1.73
Output token throughput (tok/s): 382.89
Total token throughput (tok/s): 619.85
---------------Time to First Token----------------
Mean TTFT (ms): 71.54
Median TTFT (ms): 73.88
P99 TTFT (ms): 79.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.91
Median TPOT (ms): 7.96
P99 TPOT (ms): 8.03
---------------Inter-token Latency----------------
Mean ITL (ms): 7.74
Median ITL (ms): 7.70
P99 ITL (ms): 8.39
==================================================
Advanced Usage
With more prompts for better statistics
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 100
Save results to file
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 50 \
--save-result \
--result-dir ./benchmark-results/
Custom port and concurrency
vllm bench serve \
--backend openai-chat \
--model meta-llama/Llama-3.1-8B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 100 \
--port 8001 \
--max-concurrency 4
Model Recommendations
For quick testing (small models, fast):
Qwen/Qwen2.5-1.5B-Instruct(recommended for quick tests)facebook/opt-125mfacebook/opt-350m
For realistic benchmarks (medium models):
Qwen/Qwen2.5-7B-Instructmeta-llama/Llama-3.1-8B-Instructmistralai/Mistral-7B-Instruct-v0.3
Workflow
- Check if vLLM is installed: Run
vllm --versionto verify - Check if server is already running: Run
curl http://localhost:8000/healthto check - Start vLLM server if needed: Run
vllm serve <model-name>(wait for "Application startup complete") - Run benchmark: Execute
vllm bench servewith appropriate parameters - Review results: Check throughput and latency metrics
- Clean up: If the agent skill started the vLLM server (not a pre-existing one), stop it after benchmark completion using
kill <PID>
Troubleshooting
Server not responding:
- Check if server is running:
curl http://localhost:8000/health - Verify port matches: Use
--portflag if server is on different port
Model not found:
- Ensure model name matches exactly between server and benchmark
- Check HuggingFace access:
export HF_TOKEN=<your_token>if needed
Out of memory:
- Use a smaller model (e.g., Qwen2.5-1.5B-Instruct)
- Reduce
--num-promptsor--max-concurrency
Connection refused:
- Server may still be starting (wait for "Application startup complete")
- Check firewall or network settings
Notes
- The
randomdataset generates synthetic prompts automatically - Benchmark duration scales with
--num-prompts - For production benchmarking, use at least 100 prompts for stable statistics
- Results may vary based on hardware, model size, and system load
- First run may be slower due to model loading and compilation
- Important: If the agent skill starts a vLLM server for benchmarking, it must stop the server after the benchmark completes to free up resources. Do not stop pre-existing servers that were already running before the benchmark.
More from vllm-project/vllm-skills
vllm-deploy-docker
Deploy vLLM using Docker (pre-built images or build-from-source) with NVIDIA GPU support and run the OpenAI-compatible server.
58vllm-deploy-simple
Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API.
39vllm-deploy-k8s
Deploy vLLM to Kubernetes (K8s) with GPU support, health probes, and OpenAI-compatible API endpoint. Use this skill whenever the user wants to deploy, run, or serve vLLM on a Kubernetes cluster, including creating deployments, services, checking existing deployments, or managing vLLM on K8s.
28vllm-bench-serve
Benchmark vLLM or OpenAI-compatible serving endpoints using vllm bench serve. Supports multiple datasets (random, sharegpt, sonnet, HF), backends (openai, openai-chat, vllm-pooling, embeddings), throughput/latency testing with request-rate control, and result saving. Use when benchmarking LLM serving performance, measuring TTFT/TPOT, or load testing inference APIs.
27vllm-prefix-cache-bench
This is a skill for benchmarking the efficiency of automatic prefix caching in vLLM using fixed prompts, real-world datasets, or synthetic prefix/suffix patterns. Use when the user asks to benchmark prefix caching hit rate, caching efficiency, or repeated-prompt performance in vLLM.
26