ais-bench
AISBench Benchmark Tool
AISBench Benchmark is a model evaluation tool built based on OpenCompass. It supports evaluation scenarios for both accuracy and performance testing of AI models on Ascend NPU.
Overview
- Accuracy Evaluation: Accuracy verification of service-deployed models and local models on various QA and reasoning benchmark datasets, covering text, multimodal, and other scenarios.
- Performance Evaluation: Latency and throughput evaluation of service-deployed models, extreme performance testing under stress test scenarios, steady-state performance evaluation, and real business traffic simulation.
Supported Scenarios
| Scenario | Description |
|---|---|
| Accuracy Evaluation | Model accuracy on text/multimodal datasets |
| Performance Evaluation | Latency, throughput, stress testing |
| Steady-State Performance | Obtain true optimal system performance |
| Real Traffic Simulation | Simulate real business traffic patterns |
| Multi-turn Dialogue | Evaluate multi-turn conversation models |
| Function Call (BFCL) | Function calling capability evaluation |
Supported Benchmarks
- Text: GSM8K, MMLU, Ceval, FewCLUE series, dapo_math, leval
- Multimodal: docvqa, infovqa, ocrbench_v2, omnidocbench, mmmu, mmmu_pro, mmstar, videomme, textvqa, videobench, vocalsound
- Multi-turn Dialogue: sharegpt, mtbench
- Function Call: BFCL (Berkeley Function Calling Leaderboard)
Installation
Environment Requirements
Python Version: Only Python 3.10, 3.11, or 3.12 is supported.
# Create conda environment
conda create --name ais_bench python=3.10 -y
conda activate ais_bench
Install from Source
git clone https://github.com/AISBench/benchmark.git
cd benchmark/
pip3 install -e ./ --use-pep517
Verify installation:
ais_bench -h
Optional Dependencies
# For service-deployed model evaluation (vLLM, Triton, etc.)
pip3 install -r requirements/api.txt
pip3 install -r requirements/extra.txt
# For Huggingface multimodal / vLLM offline inference
pip3 install -r requirements/hf_vl_dependency.txt
# For BFCL Function Calling evaluation
pip3 install -r requirements/datasets/bfcl_dependencies.txt --no-deps
Quick Start
Basic Command Structure
ais_bench --models <model_task> --datasets <dataset_task> [--summarizer example]
--models: Specifies the model task configuration--datasets: Specifies the dataset task configuration--summarizer: Result presentation task (default:example)
Find Configuration Files
# List all available task configurations
ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --search
Example: Service Model Accuracy Evaluation
-
Start vLLM inference service (follow vLLM documentation)
-
Prepare dataset:
- Download GSM8K from opencompass
- Extract to
ais_bench/datasets/gsm8k/
-
Modify model configuration (
vllm_api_general_chat.py):from ais_bench.benchmark.models import VLLMCustomAPIChat models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr='vllm-api-general-chat', path="", model="", stream=False, request_rate=0, retry=2, api_key="", host_ip="localhost", host_port=8080, url="", max_out_len=512, batch_size=1, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, ignore_eos=False, ) ) ] -
Run evaluation:
ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt
Output Results
dataset version metric mode vllm_api_general_chat
----------------------- -------- -------- ----- ----------------------
demo_gsm8k 401e4c accuracy gen 62.50
Model Task Types
Service-Deployed Models
| Model Type | Description |
|---|---|
vllm_api_general_chat |
General vLLM API chat model |
vllm_api_function_call_chat |
Function calling model (BFCL) |
triton_api_* |
Triton inference service |
Local Models
| Model Type | Description |
|---|---|
hf_* |
HuggingFace models |
vllm_offline_* |
vLLM offline inference |
Performance Evaluation
Key Metrics
| Metric | Description |
|---|---|
| TTFT | Time to First Token |
| TPOT | Time Per Output Token |
| Throughput | Tokens per second |
| Latency | Request latency (P50, P90, P99) |
Performance Test Example
ais_bench --models vllm_api_general_chat --datasets custom_performance \
--mode performance --concurrency 100
Steady-State Performance
For obtaining true optimal system performance:
ais_bench --models vllm_api_general_chat --datasets sharegpt \
--stable-stage --duration 300
Real Traffic Simulation
ais_bench --models vllm_api_general_chat --datasets custom \
--rps-distribution rps_config.json
Multi-task Evaluation
Multiple Models
ais_bench --models model1 model2 model3 --datasets dataset1
Multiple Datasets
ais_bench --models model1 --datasets dataset1 dataset2 dataset3
Parallel Execution
ais_bench --models model1 model2 --datasets dataset1 dataset2 --parallel 4
Custom Datasets
Performance Custom Dataset
Create a JSONL file with custom requests:
{"input": "Your prompt here", "max_output_length": 512}
Accuracy Custom Dataset
Refer to Custom Dataset Guide
Output Structure
outputs/default/20250628_151326/
├── configs/ # Combined configuration
├── logs/ # Execution logs
│ ├── eval/ # Evaluation logs
│ └── infer/ # Inference logs
├── predictions/ # Raw inference results
├── results/ # Calculated scores
└── summary/ # Final summaries
├── summary_*.csv
├── summary_*.md
└── summary_*.txt
Task Management Interface
During execution, a real-time task management interface displays:
- Task name and progress
- Time cost and status
- Log path
- Extended parameters
Controls:
Pkey: Pause/Resume screen refreshCtrl+C: Exit
Common CLI Options
| Option | Description |
|---|---|
--models |
Model task name(s) |
--datasets |
Dataset task name(s) |
--summarizer |
Result summarizer |
--search |
List config file paths |
--debug |
Print detailed logs |
--mode |
Evaluation mode (accuracy/performance) |
--parallel |
Number of parallel tasks |
--resume |
Resume from breakpoint |
--failed-only |
Re-run failed cases only |
Advanced Features
Breakpoint Resume
ais_bench --models model1 --datasets dataset1 --resume outputs/default/20250628_151326
Failed Case Re-run
ais_bench --models model1 --datasets dataset1 --failed-only --resume outputs/default/20250628_151326
Multi-file Dataset Merge
For datasets like MMLU with multiple files:
ais_bench --models model1 --datasets mmlu_merged
Repeated Inference for pass@k
ais_bench --models model1 --datasets dataset1 --repeat-n 5
Troubleshooting
Installation Issues
- Python version mismatch: Use Python 3.10/3.11/3.12
- Dependency conflicts: Use conda environment
- bfcl_eval pathlib issue: Use
--no-depsflag
Runtime Issues
- Model connection failed: Check
host_ip,host_port, and service status - Dataset not found: Download dataset to
ais_bench/datasets/ - Memory issues: Reduce
batch_sizeor use smaller dataset
Helper Scripts
Quick utility scripts for common operations:
| Script | Description |
|---|---|
| scripts/check_env.sh | Verify environment setup |
| scripts/run_accuracy_test.sh | Quick accuracy test runner |
| scripts/run_performance_test.sh | Quick performance test runner |
| scripts/parse_results.py | Parse and summarize results |
# Check environment
bash scripts/check_env.sh
# Quick accuracy test
bash scripts/run_accuracy_test.sh vllm_api_general_chat demo_gsm8k --host-port 8080
# Quick performance test
bash scripts/run_performance_test.sh vllm_api_general_chat sharegpt --concurrency 100
# Parse results
python scripts/parse_results.py outputs/default/20250628_151326
References
Detailed documentation for specific use cases:
- Model Configuration Reference: All model types (vLLM, MindIE, Triton, TGI, offline) with parameter explanations
- CLI Reference: Complete CLI options for accuracy and performance evaluation
Templates
Ready-to-use templates for custom evaluation:
| Template | Description |
|---|---|
| assets/model_config_template.py | Model configuration template |
| assets/custom_qa_template.jsonl | QA dataset template |
| assets/custom_mcq_template.csv | Multiple choice dataset template |
| assets/custom_meta_template.json | Dataset metadata template |
More from ascend-ai-coding/awesome-ascend-skills
npu-smi
Huawei Ascend NPU npu-smi command reference. Use for device queries (health, temperature, power, memory, processes, ECC), configuration (thresholds, modes, fan), firmware upgrades (MCU, bootloader, VRD), virtualization (vNPU), and certificate management.
67atc-model-converter
Complete toolkit for Huawei Ascend NPU model conversion and end-to-end inference adaptation. Workflow 1 auto-discovers input shapes and parameters from user source code. Workflow 2 exports PyTorch models to ONNX. Workflow 3 converts ONNX to .om via ATC with multi-CANN version support. Workflow 4 adapts the user's full inference pipeline (preprocessing + model + postprocessing) to run end-to-end on NPU. Workflow 5 verifies precision between ONNX and OM outputs. Workflow 6 generates a reproducible README. Supports any standard PyTorch/ONNX model. Use when converting, testing, or deploying models on Ascend AI processors.
55hccl-test
HCCL (Huawei Collective Communication Library) performance testing for Ascend NPU clusters. Use for testing distributed communication bandwidth, verifying HCCL functionality, and benchmarking collective operations like AllReduce, AllGather. Covers MPI installation, multi-node pre-flight checks (SSH/CANN version/NPU health), and production testing workflows.
54ascendc
AscendC transformer/GMM/MoE 算子与 Matmul/Cube Kernel 的统一开发规范。用于在 ops-transformer 下新增或修改 op_host、tiling/infershape、op_kernel(含 MatmulImpl/Cube 调用)、以及对应的 CANN aclnn 示例和单测。
51ascend-docker
Create Docker containers for Huawei Ascend NPU development with proper device mappings and volume mounts. Use when setting up Ascend development environments in Docker, running CANN applications in containers, or creating isolated NPU development workspaces. Supports privileged mode (default), basic mode, and full mode with profiling/logging. Auto-detects available NPU devices.
51msmodelslim
Huawei Ascend NPU model compression tool (msModelSlim). Use for LLM quantization (W4A8, W8A8, W8A8S, W8A16), MoE model compression, multimodal model compression (Qwen-VL, InternVL, HunyuanVideo, FLUX, SD3), calibration data preparation, precision auto-tuning, sensitive layer analysis, custom model integration, and deployment in MindIE/vLLM-Ascend. Supports Qwen, LLaMA, DeepSeek, GLM, Kimi, InternLM and more.
44