qwen3-tts-profile
qwen3-tts-rs Profiling & Benchmarking
Run performance profiling and benchmarks for the qwen3-tts Rust TTS engine.
Prerequisites
- Docker with
--gpus allsupport qwen3-tts:latestDocker image (has Rust toolchain + CUDA)- Model weights in
test_data/models/(1.7B-CustomVoice is the default) tokenizer.jsonmust be in the model directory
Docker Execution Pattern
The CUDA toolchain lives inside the Docker container. All cargo commands must
run there. The workspace is bind-mounted at /workspace:
docker run --rm --gpus all --entrypoint /bin/bash \
-v "$(pwd):/workspace" -w /workspace \
qwen3-tts:latest \
-c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && <COMMAND>'
Profiling Modes
1. Chrome Trace (default — best for span hierarchy)
Produces trace.json for viewing in chrome://tracing or https://ui.perfetto.dev.
docker run --rm --gpus all --entrypoint /bin/bash \
-v "$(pwd):/workspace" -w /workspace \
qwen3-tts:latest \
-c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && \
cargo run --profile=profiling --features=profiling,cuda,cli --bin e2e_bench -- \
--model-dir test_data/models/1.7B-CustomVoice --iterations 1 --warmup 1'
Output: trace.json (~12MB for 3 sentences). Contains spans:
generate_frames— full generation loopcode_predictor/code_predictor_inner— per-frame acoustic code generationtalker_step— per-frame transformer forward passsampling/top_k/top_p— per-frame token samplinggpu_synctrace events — marks everyto_vec1()GPU→CPU sync
2. Per-Stage Timing (no profiling feature needed)
The e2e_bench binary reports stage breakdowns (prefill / generation / decode)
even without the profiling feature:
docker run --rm --gpus all --entrypoint /bin/bash \
-v "$(pwd):/workspace" -w /workspace \
qwen3-tts:latest \
-c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && \
cargo run --release --features=cuda,cli --bin e2e_bench -- \
--model-dir test_data/models/1.7B-CustomVoice --iterations 3 --warmup 1'
3. Streaming TTFA (Time to First Audio)
# Add --streaming flag
... --bin e2e_bench -- --model-dir test_data/models/1.7B-CustomVoice \
--iterations 3 --warmup 1 --streaming
4. JSON Output
... --bin e2e_bench -- --model-dir test_data/models/1.7B-CustomVoice \
--json-output results.json --iterations 3
GPU Sync Audit
List all to_vec1() GPU→CPU synchronization points:
bash scripts/audit-gpu-syncs.sh
Interpreting Results
Stage Breakdown Table
Label Words Wall (ms) Audio (s) RTF Tok/s Mem (MB) Prefill Generate Decode
short 13 5235.2 3.68 1.423 8.8 858 21ms (1%) 2724ms (71%) 1109ms (29%)
medium 53 23786.3 34.00 0.700 17.9 859 20ms (0%) 22694ms (95%) 1057ms (4%)
long 115 43797.4 60.96 0.718 17.4 864 19ms (0%) 41861ms (96%) 1886ms (4%)
Key metrics:
- RTF < 1.0 = faster than real-time
- Prefill: Should be <50ms on GPU. If high, check embedding/attention.
- Generation: Dominates. ~18 GPU→CPU syncs per frame (16 code_predictor + 2 sampling).
- Decode: ConvNeXt decoder. Scales with frame count. ~4% for long text.
- Tok/s: Semantic tokens per second. Higher = better.
Chrome Trace Analysis
In Perfetto/chrome://tracing:
- Look for gaps between
talker_stepandcode_predictor— that's CPU overhead - Check if
sampling(top_k + top_p) is significant vs model forward passes - The
gpu_syncevents mark where GPU stalls waiting for CPU
Optimization Targets
The ~18 to_vec1() calls per frame are the main bottleneck:
- 16 in code_predictor (argmax per acoustic code group)
- 2 in sampling (read sampled token)
Batch these to reduce GPU→CPU round-trips.
Model Variants
| Model | Dir | Notes |
|---|---|---|
| 1.7B-CustomVoice | test_data/models/1.7B-CustomVoice |
Default benchmark target |
| 1.7B-Base | test_data/models/1.7B-Base |
Voice cloning (needs ref audio) |
| 1.7B-VoiceDesign | test_data/models/1.7B-VoiceDesign |
Text-described voices |
Reference Baseline (1.7B-CustomVoice, CUDA)
From January 2025 on DGX (A100):
- Short (13 words): RTF 1.42, 8.8 tok/s
- Medium (53 words): RTF 0.70, 17.9 tok/s
- Long (115 words): RTF 0.72, 17.4 tok/s
- Prefill: ~20ms, Decode: ~1-2s, Generation: 71-96%
More from trevors/dot-claude
jj-workflow
Jujutsu (jj) version control, load skill when hook output shows vcs=jj-colocated or vcs=jj in the system-reminder.
93notion-formatter
Format markdown content for Notion import with proper syntax for toggles, code blocks, and tables. Use when formatting responses for Notion, creating Notion-compatible documentation, or preparing markdown for Notion paste/import.
47book-reader
Read and search digital books (PDF, EPUB, MOBI, TXT). Use when answering questions about a book, finding quotes or passages, navigating to specific pages or chapters, or extracting information from documents.
44using-jj
Advanced jj/jujutsu workflows — revsets, absorb, evolog, op restore/undo, immutable_heads bypass, divergent-change resolution, jj split, parallel jj new, conflict-after-rebase, force-push recovery. Contains non-obvious rules (e.g., always `-m` to avoid editor) that prevent broken workflows. Skip for simple commit/push/rebase.
43svelte5
Svelte 5 syntax reference. Use when writing ANY Svelte component. Svelte 5 uses runes ($state, $derived, $effect, $props) instead of Svelte 4 patterns. Training data is heavily Svelte 4—this skill prevents outdated syntax.
39maintaining-claude-code
Create, validate, and improve Claude Code configuration — SKILL.md files, CLAUDE.md, rules, hooks, and settings.json. Use when creating a new skill, writing a SKILL.md, adding a hook, editing rules, auditing skill descriptions, checking config quality, debugging hook behavior, or deciding between skills vs rules vs CLAUDE.md. Also auto-loads when working in ~/.claude/ on skills, rules, hooks, or settings.
32