text2speech
Text2Speech Skill
Generate high-quality text-to-speech audio using Qwen3-TTS models.
Prerequisites
- Python 3.8+
requestspackage- Access to TTSWeb API (https://mc.agaii.org/TTS)
Installation
Via npm (Node.js)
npm install -g @catfishw/text2speech-skill
Via pip (Python)
pip install git+https://github.com/CatfishW/TTSAgentSkill.git
Direct Usage
python3 -m text2speech_skill.cli --help
Quick Start
Speak with Preset Speaker
text2speech speak "Hello world" -s vivian -o hello.wav
Design Custom Voice
text2speech design "Welcome to the future" \
-d "futuristic female AI assistant, clear and professional" \
-o welcome.wav
Clone Voice from Audio
text2speech clone "This is my cloned voice speaking" \
-a reference.wav \
-r "original transcript of reference audio" \
-o cloned.wav
Clone with Preset Timbre
text2speech clone "Hello" -t ryan -o output.wav
Commands
speak
Text-to-speech with preset speaker voices.
text2speech speak <text> [options]
Options:
-s, --speaker Speaker name (default: vivian)
-l, --language Language code (default: Auto)
-i, --instruct Style instruction (e.g., "speak cheerfully")
-o, --output Output audio file (required)
Speakers: vivian, ryan, aiden, dylan, eric, ono_anna, serena, sohee, uncle_fu
Examples:
text2speech speak "Hello" -s vivian -o hello.wav
text2speech speak "Bonjour" -s serena -l French -o bonjour.wav
text2speech speak "Hi" -s ryan -i "speak like a news anchor" -o hi.wav
design
Create voice from natural language description.
text2speech design <text> -d <description> [options]
Options:
-d, --description Voice description (required)
-l, --language Language code
-o, --output Output audio file (required)
Examples:
text2speech design "Hello" -d "old man with raspy voice" -o oldman.wav
text2speech design "Welcome" -d "young energetic female, enthusiastic" -o welcome.wav
clone
Clone voice from reference audio or preset timbre.
text2speech clone <text> [options]
Options:
-a, --audio Reference audio file
-t, --timbre Preset timbre speaker (alternative to audio)
-r, --ref-text Reference transcript (for ICL mode)
-x, --x-vector-only Use x-vector only mode
-i, --instruct Style instruction
-l, --language Language code
-o, --output Output audio file (required)
Examples:
# Clone from audio with transcript (ICL mode)
text2speech clone "Hello" -a ref.wav -r "original text" -o out.wav
# Clone from audio (x-vector only, faster)
text2speech clone "Hello" -a ref.wav -x -o out.wav
# Clone using preset timbre
text2speech clone "Hello" -t ryan -o out.wav
batch-speak
Batch process multiple text files.
text2speech batch-speak <input_dir> <output_dir> [options]
Options:
-s, --speaker Speaker name (default: vivian)
-l, --language Language code
-i, --instruct Style instruction
Input: Directory containing .txt files
Output: Audio files + batch_report.json
Example:
mkdir -p texts output
echo "Hello" > texts/1.txt
echo "World" > texts/2.txt
text2speech batch-speak texts/ output/ -s vivian
batch-clone
Batch clone voice for multiple texts.
text2speech batch-clone <input_dir> <output_dir> -a <audio> [options]
Options:
-a, --audio Reference audio (required)
-r, --ref-text Reference transcript
-l, --language Language code
Example:
text2speech batch-clone texts/ output/ -a reference.wav -r "transcript"
encode
Encode audio to tokens (tokenizer).
text2speech encode <audio> [-o output.json]
Example:
text2speech encode audio.wav -o tokens.json
cat tokens.json | jq '.count'
decode
Decode tokens to audio.
text2speech decode <tokens_file> -o <output>
Example:
text2speech decode tokens.json -o output.wav
status
Check service status.
text2speech status
Shows:
- API health
- GPU availability
- Loaded models
- Speaker count
speakers
List available preset speakers.
text2speech speakers
languages
List supported languages.
text2speech languages
API Configuration
Default API: https://mc.agaii.org/TTS/api/v1
To use local backend, modify text2speech_skill/cli.py:
API_BASE = "http://localhost:24536/api/v1"
Voice Cloning Modes
ICL Mode (In-Context Learning)
- Requires reference transcript (
--ref-text) - Higher quality, follows reference prosody
- Default mode when transcript provided
X-Vector Mode
- Use
--x-vector-onlyflag - Faster, only speaker characteristics
- No transcript needed
Tips
- Use
@file.txtsyntax to read text from file:text2speech speak @input.txt -o out.wav - Reference audio should be clear and 5-30 seconds for best cloning
- ICL mode produces better results than x-vector when transcript is accurate
- Batch operations save a
batch_report.jsonwith results
Troubleshooting
Job fails with "ref_text required"
→ Add --ref-text with transcript or use --x-vector-only
Audio quality is poor → Use clearer reference audio, or try different speaker/timbre
Timeout on long text → Break into smaller chunks, or use batch mode