MOSS-TTS-Nano Speech Generation Skill

Skill by ara.so — Daily 2026 Skills collection.

MOSS-TTS-Nano is an open-source multilingual tiny TTS model (0.1B parameters) from MOSI.AI and the OpenMOSS team. It uses an Audio Tokenizer + LLM autoregressive pipeline to generate 48 kHz stereo speech in real time, supports 20 languages, voice cloning, streaming inference, and runs on CPU without a GPU.

Installation

Conda (recommended)

conda create -n moss-tts-nano python=3.12 -y
conda activate moss-tts-nano

git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git
cd MOSS-TTS-Nano

pip install -r requirements.txt
pip install -e .

Fix WeTextProcessing if it fails

conda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.git

After pip install -e . the moss-tts-nano CLI command is available in the active environment.

Model Weights

Models are auto-downloaded from Hugging Face on first run:

TTS model: OpenMOSS-Team/MOSS-TTS-Nano
Audio tokenizer: OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano

ModelScope mirrors are available at openmoss/MOSS-TTS-Nano and openmoss/MOSS-Audio-Tokenizer-Nano.

CLI Commands

Generate speech (voice clone mode)

moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"

Output defaults to generated_audio/moss_tts_nano_output.wav.

Generate from a text file (long-form)

moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text-file my_script.txt \
  --output output.wav

Launch local web demo

moss-tts-nano serve
# or directly:
python app.py

Opens at http://127.0.0.1:18083 — model stays loaded in memory for fast repeated requests.

Direct Python entrypoint

python infer.py \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "Hello, this is a test of MOSS-TTS-Nano."

Output: generated_audio/infer_output.wav

Python API Usage

Basic voice clone inference

from infer import MossTTSNanoInference

# Initialize once (downloads weights on first run)
tts = MossTTSNanoInference()

# Voice clone: synthesize text in the style of the reference audio
audio = tts.infer(
    text="欢迎使用MOSS语音合成系统。",
    prompt_audio_path="assets/audio/zh_1.wav",
)

# Save output
import soundfile as sf
sf.write("output.wav", audio, samplerate=48000)

English voice clone

from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

audio = tts.infer(
    text="Welcome to MOSS TTS Nano, a tiny but capable text to speech model.",
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("english_output.wav", audio, samplerate=48000)

Streaming inference (low latency)

from infer import MossTTSNanoInference
import soundfile as sf
import numpy as np

tts = MossTTSNanoInference()

chunks = []
for audio_chunk in tts.infer_stream(
    text="This sentence is generated chunk by chunk for low latency playback.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    chunks.append(audio_chunk)
    # process or play chunk in real time here

full_audio = np.concatenate(chunks)
sf.write("streamed_output.wav", full_audio, samplerate=48000)

Long-text synthesis with chunked voice cloning

from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

long_text = """
MOSS-TTS-Nano supports long-form synthesis through automatic chunking.
Each chunk uses the same reference voice, producing consistent speaker identity
across the entire output even for multi-paragraph documents.
"""

audio = tts.infer(
    text=long_text,
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("long_form_output.wav", audio, samplerate=48000)

FastAPI HTTP endpoint usage

When the server is running (moss-tts-nano serve or python app.py):

import requests
import base64
import soundfile as sf
import io
import numpy as np

# Read reference audio as base64
with open("assets/audio/zh_1.wav", "rb") as f:
    ref_audio_b64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "http://127.0.0.1:18083/generate",
    json={
        "text": "你好，这是一个语音合成测试。",
        "prompt_audio_base64": ref_audio_b64,
    },
)

data = response.json()
audio_bytes = base64.b64decode(data["audio_base64"])

audio_array, sr = sf.read(io.BytesIO(audio_bytes))
sf.write("api_output.wav", audio_array, samplerate=sr)

Streaming HTTP response (real-time web playback)

import requests

with open("assets/audio/zh_1.wav", "rb") as f:
    ref_audio_b64 = __import__("base64").b64encode(f.read()).decode()

with requests.post(
    "http://127.0.0.1:18083/generate_stream",
    json={
        "text": "流式语音合成示例，适合实时播放场景。",
        "prompt_audio_base64": ref_audio_b64,
    },
    stream=True,
) as resp:
    with open("stream_output.wav", "wb") as out:
        for chunk in resp.iter_content(chunk_size=4096):
            out.write(chunk)

Supported Languages

Code	Language	Code	Language	Code	Language
zh	Chinese	en	English	de	German
es	Spanish	fr	French	ja	Japanese
it	Italian	hu	Hungarian	ko	Korean
ru	Russian	fa	Persian	ar	Arabic
pl	Polish	pt	Portuguese	cs	Czech
da	Danish	sv	Swedish	el	Greek
tr	Turkish

The language is inferred automatically from the input text and the reference audio. No explicit language code parameter is required for basic usage.

Architecture Overview

Pipeline: Audio Tokenizer + LLM (pure autoregressive)
Audio Tokenizer: MOSS-Audio-Tokenizer-Nano (~20M params), CNN-free causal Transformer (Cat architecture)
Output: 48 kHz, 2-channel (stereo)
Token rate: 12.5 Hz token stream
Codebooks: RVQ with 16 codebooks (0.125 kbps – 2 kbps)
LLM: ~0.1B parameters total

Key CLI Flags

Flag	Alias	Description
`--prompt-audio-path`	—	Path to reference WAV for voice cloning (`infer.py`)
`--prompt-speech`	—	Same purpose in `moss-tts-nano generate` CLI
`--text`	—	Input text string
`--text-file`	—	Path to plain text file for long-form synthesis
`--output`	—	Output WAV file path (default varies by entrypoint)

Common Patterns

Pattern: Batch synthesis with one reference voice

from infer import MossTTSNanoInference
import soundfile as sf

tts = MossTTSNanoInference()
ref = "assets/audio/zh_1.wav"

sentences = [
    "第一句话，用于批量合成测试。",
    "第二句话，保持相同的音色。",
    "第三句话，输出独立的音频文件。",
]

for i, sentence in enumerate(sentences):
    audio = tts.infer(text=sentence, prompt_audio_path=ref)
    sf.write(f"output_{i:02d}.wav", audio, samplerate=48000)
    print(f"Saved output_{i:02d}.wav")

Pattern: Real-time playback with sounddevice

import sounddevice as sd
import numpy as np
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

buffer = []
for chunk in tts.infer_stream(
    text="Real-time playback example using sounddevice.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    buffer.append(chunk)

audio = np.concatenate(buffer)
sd.play(audio, samplerate=48000)
sd.wait()

Pattern: Gradio integration

import gradio as gr
import soundfile as sf
import numpy as np
import io
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

def synthesize(reference_audio_path: str, text: str):
    audio = tts.infer(text=text, prompt_audio_path=reference_audio_path)
    # Return as (sample_rate, numpy_array) tuple for Gradio Audio component
    return (48000, audio)

demo = gr.Interface(
    fn=synthesize,
    inputs=[
        gr.Audio(type="filepath", label="Reference Voice"),
        gr.Textbox(label="Text to synthesize"),
    ],
    outputs=gr.Audio(label="Generated Speech"),
    title="MOSS-TTS-Nano Voice Clone",
)

demo.launch()

Troubleshooting

WeTextProcessing install fails

# Use conda to get pynini, then install from source
conda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.git

Model download is slow or fails

Set HF_ENDPOINT to a mirror if Hugging Face is unreachable:

export HF_ENDPOINT=https://hf-mirror.com
python infer.py --prompt-audio-path assets/audio/zh_1.wav --text "测试"

Or use ModelScope:

pip install modelscope

Then point model paths to openmoss/MOSS-TTS-Nano and openmoss/MOSS-Audio-Tokenizer-Nano.

Out of memory on CPU

Use streaming inference (infer_stream) to reduce peak memory.
Reduce chunk size for long text inputs — the model handles chunked voice cloning automatically.
Close other applications; the model needs ~1–2 GB RAM.

Audio output is silent or corrupt

Ensure the reference WAV is a clean mono or stereo file, 16-bit or float32, any sample rate (it will be resampled).
Minimum reference audio duration: ~3–5 seconds for reliable voice cloning.
Avoid reference audio with heavy background noise.

`moss-tts-nano` command not found

# Re-run editable install inside the active conda env
pip install -e .
which moss-tts-nano   # should resolve now

Port conflict for web demo

# Default port is 18083; check what occupies it
lsof -i :18083
# Kill if needed, then relaunch
moss-tts-nano serve

Output Defaults

Entrypoint	Default output path
`python infer.py`	`generated_audio/infer_output.wav`
`moss-tts-nano generate`	`generated_audio/moss_tts_nano_output.wav`
`python app.py` / `moss-tts-nano serve`	returned via HTTP response

The generated_audio/ directory is created automatically if it does not exist.

moss-tts-nano-speech