perf-test-flagos

Installation
SKILL.md

Accuracy + Performance Test

Start vLLM serve with the target model, run accuracy benchmarks (when FlagEval is available) and performance benchmarks (vllm bench serve) across multiple profiles.

Skill Components

perf-test/
├── SKILL.md                            # This file — execution flow
├── scripts/
│   ├── run_benchmark.py                # Run single benchmark profile (JSON output)
│   └── run_all_benchmarks.py           # Run all 5 profiles, collect + summarize (JSON)
└── references/
    └── benchmark-profiles.md           # Profile definitions, metrics, vllm bench usage

Reused from env-verify:

  • env-verify/scripts/test_serve_mode.py — can be used to verify server is healthy before benchmarking (optional pre-check)

Prerequisites

  • Running container with software stack installed
  • model-verify completed — know which stack to use (full vs base)
  • Model path, TP size, and recommended stack from model-verify

If invoked standalone, ask for container name, model path, TP size, and stack config. If invoked from /flagrelease, these are passed as context.

Execution Flow

Step 1: Start vLLM Server

Use the stack recommended by model-verify. Read references/benchmark-profiles.md for the vllm serve command pattern.

docker exec -d <CONTAINER> bash -c '
export USE_FLAGGEMS=<0|1>
export FLAGCX_PATH=<path_or_unset>
export VLLM_PLUGINS=<fl_or_unset>
vllm serve <MODEL_PATH> \
    --tensor-parallel-size <TP_SIZE> \
    --max-num-batched-tokens 4096 \
    --max-num-seqs 256 \
    --trust-remote-code \
    --port 8000 \
    <EXTRA_ARGS>
'

Wait for server ready (poll /health, timeout 300s):

docker exec <CONTAINER> bash -c '
for i in $(seq 1 150); do
    if curl -s http://localhost:8000/health 2>/dev/null | grep -qE "ok|200|\{\}"; then
        echo "SERVER_READY"; break
    fi
    sleep 2
done
'

If server doesn't start, report error and exit.

Step 2: Get Model Name from Server

docker exec <CONTAINER> bash -c '
curl -s http://localhost:8000/v1/models | python3 -c "
import json, sys; print(json.load(sys.stdin)[\"data\"][0][\"id\"])"
'

Part A: Accuracy Test (FlagEval) — PLACEHOLDER

STATUS: FlagEval test client not yet available.

When FlagEval becomes available, update this section with:

  • Docker image URL or pip package name
  • Supported benchmarks (MMLU, GSM8K, HumanEval, etc.)
  • Required arguments and configuration
  • Expected output format
  • Pass/fail criteria (accuracy thresholds)

Current behavior: Report accuracy test as SKIPPED.


Part B: Performance Benchmarks

Step 3: Run All Benchmark Profiles

Copy scripts into the container and run:

docker cp <SKILL_DIR>/scripts/run_benchmark.py <CONTAINER>:/tmp/
docker cp <SKILL_DIR>/scripts/run_all_benchmarks.py <CONTAINER>:/tmp/

docker exec <CONTAINER> python3 /tmp/run_all_benchmarks.py \
    --model <MODEL_NAME> \
    --tokenizer <MODEL_PATH> \
    --port 8000 \
    --output-dir /data/results/perf

The script runs all 5 default profiles (see references/benchmark-profiles.md), saves per-profile JSON to /data/results/perf/, and outputs a combined JSON report with a summary table.

Important: One profile failure does NOT skip remaining profiles.

Step 4: Stop Server

docker exec <CONTAINER> bash -c 'pkill -f "vllm serve" || true'

Step 5: Produce Report

{
  "status": "PASS | PARTIAL | FAIL",
  "stage": "perf-test",
  "model": "<MODEL_PATH>",
  "tensor_parallel_size": 8,
  "flags": {"USE_FLAGGEMS": "1|0", "FLAGCX_PATH": "..."},
  "accuracy": {
    "status": "SKIPPED",
    "reason": "FlagEval test client not yet available"
  },
  "performance": {
    "status": "PASS | PARTIAL | FAIL",
    "profiles_passed": "5/5",
    "profiles": [ "...per-profile results..." ],
    "summary_table": "...markdown table..."
  }
}

Present the summary table to the user:

| Profile | Input | Output | Prompts | Req/s | Tok/s | TTFT(ms) | TPOT(ms) | P99(ms) | Status |
|---------|-------|--------|---------|-------|-------|----------|----------|---------|--------|
| ...     | ...   | ...    | ...     | ...   | ...   | ...      | ...      | ...     | ...    |

Status logic:

  • PASS — all profiles completed
  • PARTIAL — some passed, some failed
  • FAIL — server didn't start or all profiles failed

Error Handling

Failure Behavior
Server fails to start Report error; exit
vllm bench serve not found Report vllm version issue
Single profile fails Report error, continue remaining profiles
Single profile times out Kill after 600s, report partial, continue
Server crashes mid-benchmark Capture logs, report which profile caused crash
OOM during high concurrency Report, suggest reducing num_prompts

Timeout Rules

Operation Timeout
Server startup 300s
Per profile benchmark 600s
Related skills
Installs
23
GitHub Stars
10
First Seen
Mar 26, 2026