qwen3-tts

SKILL.md

Qwen3-TTS

Build text-to-speech applications using Qwen3-TTS from Alibaba Qwen. Reference the local repository at D:\code\qwen3-tts for source code and examples.

Quick Reference

Task Model Method
Custom voice with preset speakers CustomVoice generate_custom_voice()
Design new voice via description VoiceDesign generate_voice_design()
Clone voice from audio sample Base generate_voice_clone()
Encode/decode audio Tokenizer encode() / decode()

Environment Setup

# Create fresh environment
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

# Install package
pip install -U qwen-tts

# Optional: FlashAttention 2 for reduced GPU memory
pip install -U flash-attn --no-build-isolation

Available Models

Model Features
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice 9 preset speakers, instruction control
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign Create voices from natural language descriptions
Qwen/Qwen3-TTS-12Hz-1.7B-Base Voice cloning, fine-tuning base
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice Smaller custom voice model
Qwen/Qwen3-TTS-12Hz-0.6B-Base Smaller base model for cloning/fine-tuning
Qwen/Qwen3-TTS-Tokenizer-12Hz Audio encoder/decoder

Task Workflows

1. Custom Voice Generation

Use preset speakers with optional style instructions.

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

# Single generation
wavs, sr = model.generate_custom_voice(
    text="Hello, how are you today?",
    language="English",  # Or "Auto" for auto-detection
    speaker="Ryan",
    instruct="Speak with enthusiasm",  # Optional style control
)
sf.write("output.wav", wavs[0], sr)

# Batch generation
wavs, sr = model.generate_custom_voice(
    text=["First sentence.", "Second sentence."],
    language=["English", "English"],
    speaker=["Ryan", "Aiden"],
    instruct=["Happy tone", "Calm tone"],
)

Available Speakers:

Speaker Description Native Language
Vivian Bright, edgy young female Chinese
Serena Warm, gentle young female Chinese
Uncle_Fu Low, mellow mature male Chinese
Dylan Youthful Beijing male Chinese (Beijing)
Eric Lively Chengdu male Chinese (Sichuan)
Ryan Dynamic male with rhythmic drive English
Aiden Sunny American male English
Ono_Anna Playful Japanese female Japanese
Sohee Warm Korean female Korean

2. Voice Design

Create new voices from natural language descriptions.

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = model.generate_voice_design(
    text="Welcome to our presentation today.",
    language="English",
    instruct="Professional male voice, warm baritone, confident and clear",
)
sf.write("designed_voice.wav", wavs[0], sr)

3. Voice Cloning

Clone a voice from a reference audio sample (3+ seconds recommended).

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

# Direct cloning
wavs, sr = model.generate_voice_clone(
    text="This is the cloned voice speaking.",
    language="English",
    ref_audio="path/to/reference.wav",  # Or URL or (numpy_array, sr) tuple
    ref_text="Transcript of the reference audio.",
)
sf.write("cloned.wav", wavs[0], sr)

# Reusable clone prompt (for multiple generations)
prompt = model.create_voice_clone_prompt(
    ref_audio="path/to/reference.wav",
    ref_text="Transcript of the reference audio.",
)
wavs, sr = model.generate_voice_clone(
    text="Another sentence with the same voice.",
    language="English",
    voice_clone_prompt=prompt,
)

4. Voice Design + Clone Workflow

Design a voice, then reuse it across multiple generations.

# Step 1: Design the voice
design_model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

ref_text = "Sample text for the reference audio."
ref_wavs, sr = design_model.generate_voice_design(
    text=ref_text,
    language="English",
    instruct="Young energetic male, tenor range",
)

# Step 2: Create reusable clone prompt
clone_model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

prompt = clone_model.create_voice_clone_prompt(
    ref_audio=(ref_wavs[0], sr),
    ref_text=ref_text,
)

# Step 3: Generate multiple outputs with consistent voice
for sentence in ["First line.", "Second line.", "Third line."]:
    wavs, sr = clone_model.generate_voice_clone(
        text=sentence,
        language="English",
        voice_clone_prompt=prompt,
    )

5. Audio Tokenization

Encode and decode audio for transport or processing.

from qwen_tts import Qwen3TTSTokenizer
import soundfile as sf

tokenizer = Qwen3TTSTokenizer.from_pretrained(
    "Qwen/Qwen3-TTS-Tokenizer-12Hz",
    device_map="cuda:0",
)

# Encode audio (accepts path, URL, numpy array, or base64)
enc = tokenizer.encode("path/to/audio.wav")

# Decode back to waveform
wavs, sr = tokenizer.decode(enc)
sf.write("reconstructed.wav", wavs[0], sr)

Generation Parameters

Common parameters for all generate_* methods:

wavs, sr = model.generate_custom_voice(
    text="...",
    language="Auto",
    speaker="Ryan",
    max_new_tokens=2048,
    do_sample=True,
    top_k=50,
    top_p=1.0,
    temperature=0.9,
    repetition_penalty=1.05,
)

Web UI Demo

Launch local Gradio demo:

# CustomVoice demo
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 8000

# VoiceDesign demo
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --ip 0.0.0.0 --port 8000

# Base (voice clone) demo - requires HTTPS for microphone
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000 \
  --ssl-certfile cert.pem --ssl-keyfile key.pem --no-ssl-verify

Supported Languages

Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

Pass language="Auto" for automatic detection, or specify explicitly for best quality.

References

Weekly Installs
8
GitHub Stars
2
First Seen
Jan 25, 2026
Installed on
opencode8
codex8
github-copilot8
claude-code8
antigravity8
gemini-cli8