qwen3-tts
Qwen3-TTS
Build text-to-speech applications using Qwen3-TTS from Alibaba Qwen. Reference the local repository at D:\code\qwen3-tts for source code and examples.
Quick Reference
| Task | Model | Method |
|---|---|---|
| Custom voice with preset speakers | CustomVoice | generate_custom_voice() |
| Design new voice via description | VoiceDesign | generate_voice_design() |
| Clone voice from audio sample | Base | generate_voice_clone() |
| Encode/decode audio | Tokenizer | encode() / decode() |
Environment Setup
# Create fresh environment
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts
# Install package
pip install -U qwen-tts
# Optional: FlashAttention 2 for reduced GPU memory
pip install -U flash-attn --no-build-isolation
Available Models
| Model | Features |
|---|---|
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice |
9 preset speakers, instruction control |
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign |
Create voices from natural language descriptions |
Qwen/Qwen3-TTS-12Hz-1.7B-Base |
Voice cloning, fine-tuning base |
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice |
Smaller custom voice model |
Qwen/Qwen3-TTS-12Hz-0.6B-Base |
Smaller base model for cloning/fine-tuning |
Qwen/Qwen3-TTS-Tokenizer-12Hz |
Audio encoder/decoder |
Task Workflows
1. Custom Voice Generation
Use preset speakers with optional style instructions.
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
# Single generation
wavs, sr = model.generate_custom_voice(
text="Hello, how are you today?",
language="English", # Or "Auto" for auto-detection
speaker="Ryan",
instruct="Speak with enthusiasm", # Optional style control
)
sf.write("output.wav", wavs[0], sr)
# Batch generation
wavs, sr = model.generate_custom_voice(
text=["First sentence.", "Second sentence."],
language=["English", "English"],
speaker=["Ryan", "Aiden"],
instruct=["Happy tone", "Calm tone"],
)
Available Speakers:
| Speaker | Description | Native Language |
|---|---|---|
| Vivian | Bright, edgy young female | Chinese |
| Serena | Warm, gentle young female | Chinese |
| Uncle_Fu | Low, mellow mature male | Chinese |
| Dylan | Youthful Beijing male | Chinese (Beijing) |
| Eric | Lively Chengdu male | Chinese (Sichuan) |
| Ryan | Dynamic male with rhythmic drive | English |
| Aiden | Sunny American male | English |
| Ono_Anna | Playful Japanese female | Japanese |
| Sohee | Warm Korean female | Korean |
2. Voice Design
Create new voices from natural language descriptions.
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
wavs, sr = model.generate_voice_design(
text="Welcome to our presentation today.",
language="English",
instruct="Professional male voice, warm baritone, confident and clear",
)
sf.write("designed_voice.wav", wavs[0], sr)
3. Voice Cloning
Clone a voice from a reference audio sample (3+ seconds recommended).
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
# Direct cloning
wavs, sr = model.generate_voice_clone(
text="This is the cloned voice speaking.",
language="English",
ref_audio="path/to/reference.wav", # Or URL or (numpy_array, sr) tuple
ref_text="Transcript of the reference audio.",
)
sf.write("cloned.wav", wavs[0], sr)
# Reusable clone prompt (for multiple generations)
prompt = model.create_voice_clone_prompt(
ref_audio="path/to/reference.wav",
ref_text="Transcript of the reference audio.",
)
wavs, sr = model.generate_voice_clone(
text="Another sentence with the same voice.",
language="English",
voice_clone_prompt=prompt,
)
4. Voice Design + Clone Workflow
Design a voice, then reuse it across multiple generations.
# Step 1: Design the voice
design_model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
ref_text = "Sample text for the reference audio."
ref_wavs, sr = design_model.generate_voice_design(
text=ref_text,
language="English",
instruct="Young energetic male, tenor range",
)
# Step 2: Create reusable clone prompt
clone_model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
prompt = clone_model.create_voice_clone_prompt(
ref_audio=(ref_wavs[0], sr),
ref_text=ref_text,
)
# Step 3: Generate multiple outputs with consistent voice
for sentence in ["First line.", "Second line.", "Third line."]:
wavs, sr = clone_model.generate_voice_clone(
text=sentence,
language="English",
voice_clone_prompt=prompt,
)
5. Audio Tokenization
Encode and decode audio for transport or processing.
from qwen_tts import Qwen3TTSTokenizer
import soundfile as sf
tokenizer = Qwen3TTSTokenizer.from_pretrained(
"Qwen/Qwen3-TTS-Tokenizer-12Hz",
device_map="cuda:0",
)
# Encode audio (accepts path, URL, numpy array, or base64)
enc = tokenizer.encode("path/to/audio.wav")
# Decode back to waveform
wavs, sr = tokenizer.decode(enc)
sf.write("reconstructed.wav", wavs[0], sr)
Generation Parameters
Common parameters for all generate_* methods:
wavs, sr = model.generate_custom_voice(
text="...",
language="Auto",
speaker="Ryan",
max_new_tokens=2048,
do_sample=True,
top_k=50,
top_p=1.0,
temperature=0.9,
repetition_penalty=1.05,
)
Web UI Demo
Launch local Gradio demo:
# CustomVoice demo
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 8000
# VoiceDesign demo
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --ip 0.0.0.0 --port 8000
# Base (voice clone) demo - requires HTTPS for microphone
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000 \
--ssl-certfile cert.pem --ssl-keyfile key.pem --no-ssl-verify
Supported Languages
Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Pass language="Auto" for automatic detection, or specify explicitly for best quality.
References
- Fine-tuning guide: See references/finetuning.md for training custom speakers
- API details: See references/api-reference.md for complete method signatures
- Local repo:
D:\code\qwen3-ttscontains source code and examples
More from jarmen423/skills
frontend-design
Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, artifacts, posters, or applications (examples include websites, landing pages, dashboards, React components, HTML/CSS layouts, or when styling/beautifying any web UI). Generates creative, polished code and UI design that avoids generic AI aesthetics.
15xlsx
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
14runpod-serverless
Create serverless endpoint templates and endpoints on RunPod.io. Supports Python/Node.js runtimes, GPU selection (3090, A100, etc.), and idempotent configuration. Use this skill when a user wants to set up a new serverless endpoint or template on RunPod.
13canvas-design
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.
13theme-factory
Toolkit for styling artifacts with a theme. These artifacts can be slides, docs, reportings, HTML landing pages, etc. There are 10 pre-set themes with colors/fonts that you can apply to any artifact that has been creating, or can generate a new theme on-the-fly.
12algorithmic-art
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
12