piper-tts-training
Piper TTS Voice Training
Train custom text-to-speech voices compatible with Piper's lightweight ONNX runtime.
Overview
Piper produces fast, offline TTS suitable for embedded devices. Training involves:
- Corpus preparation (text covering phonetic range)
- Audio generation or recording
- Quality validation via Whisper transcription
- Fine-tuning from existing checkpoint (recommended) or training from scratch
- ONNX export for deployment
Fine-tuning vs from-scratch:
- Fine-tuning: ~1,300 phrases + 1,000 epochs (days on modest GPU)
- From scratch: ~13,000+ phrases + 2,000+ epochs (weeks/months)
Workflow
1. Corpus Preparation
Gather 1,300-1,500+ phrases covering broad phonetic range:
- Use piper-recording-studio corpus as base
- Add domain-specific phrases for your use case
- Include varied sentence structures and lengths
Critical for non-US English: Ensure corpus uses correct regional spelling. See Localisation.
2. Audio Generation
Generate or record training audio at 22050Hz mono WAV.
If using voice cloning (e.g., Chatterbox TTS):
- Generate at source sample rate (often 24kHz)
- Convert to 22050Hz:
sox -v 0.95 input.wav -r 22050 -t wav output.wav - The
-v 0.95prevents clipping during resampling
Recording requirements:
- Consistent microphone position and room acoustics
- Minimal background noise
- Natural speaking pace (not reading voice)
3. Quality Validation with Whisper
Automate quality checks rather than manual listening:
import whisper
from piper_phonemize import phonemize_text
model = whisper.load_model("base")
def validate_sample(audio_path, expected_text):
result = model.transcribe(audio_path)
transcribed = result["text"].strip()
# Compare phonemically to handle spelling/punctuation differences
expected_phonemes = phonemize_text(expected_text, "en-gb")
transcribed_phonemes = phonemize_text(transcribed, "en-gb")
return expected_phonemes == transcribed_phonemes
Retry failed samples up to 3 times. Target 95%+ dataset coverage.
4. Dataset Format (LJSpeech)
Structure your dataset:
dataset/
├── metadata.csv
└── wavs/
├── sample_0001.wav
├── sample_0002.wav
└── ...
metadata.csv format: {id}|{text} (pipe-separated, no headers)
sample_0001|The quick brown fox jumps over the lazy dog.
sample_0002|Pack my box with five dozen liquor jugs.
5. Preprocessing
Convert to PyTorch tensors:
python3 -m piper_train.preprocess \
--language en-gb \
--input-dir dataset/ \
--output-dir piper_training_dir/ \
--dataset-format ljspeech
Use en-gb for Australian/NZ/UK voices (espeak-ng phoneme set).
6. Training
Fine-tuning (recommended):
python3 -m piper_train \
--dataset-dir piper_training_dir/ \
--accelerator gpu \
--devices 1 \
--batch-size 12 \
--max_epochs 3000 \
--resume_from_checkpoint ljspeech-2000.ckpt \
--checkpoint-epochs 100 \
--quality high \
--precision 32
Key parameters:
--batch-size: Reduce if VRAM limited (12 works on 8GB)--resume_from_checkpoint: Start from LJSpeech high-quality checkpoint--precision 32: More stable than mixed precision--validation-split 0.0 --num-test-examples 0: Skip validation for small datasets
Monitor with TensorBoard: watch loss_disc_all for convergence.
7. ONNX Export
python3 -m piper_train.export_onnx checkpoint.ckpt output.onnx.unoptimized
onnxsim output.onnx.unoptimized output.onnx
Create metadata file output.onnx.json from training config.json.
Localisation for Australian, New Zealand and UK English
Piper uses espeak-ng for phonemisation. American pronunciations in training data cause accent drift.
Corpus preparation:
- Run
scripts/convert_spelling.pyon corpus text before training - Use
en-gboren-auespeak-ng voice for phonemisation - Review generated phonemes for Americanisms
Common spelling conversions:
| American | Australian/UK |
|---|---|
| -ize | -ise |
| -or | -our |
| -er | -re |
| -og | -ogue |
| -ense | -ence |
Phoneme considerations:
- /r/ linking and intrusion patterns differ
- Vowel sounds in words like "dance", "bath", "castle"
- Final -ile pronunciation (hostile, missile)
For complete word lists and phonetic details, see references/localisation.md.
Validation: Use Whisper with language="en" and verify transcriptions match expected regional forms.
Dependencies
Pin versions to avoid API breakage:
pytorch-lightning==1.9.3
torch<2.6.0
piper-phonemize
onnxruntime-gpu
onnxsim
Docker containerisation recommended for reproducibility.
Hardware Requirements
Minimum (fine-tuning):
- 8GB VRAM GPU (Pascal or newer)
- 8GB system RAM
- ~5 days for 1,000 epochs on Tesla P4
From scratch: Multiply time by ~200x.
Troubleshooting
| Issue | Solution |
|---|---|
| CUDA OOM | Reduce batch-size (try 8 or 4) |
| Checkpoint won't load | Check pytorch-lightning version matches checkpoint |
| Garbled output | Insufficient training epochs or dataset too small |
| Wrong accent | Check espeak-ng language code and corpus spelling |