Inworld AI

Text-to-Speech platform with voice cloning, audio markups, and timestamp alignment.

Quick Navigation

Topic	Reference
Installation	installation.md
Voice Cloning	cloning.md
Voice Control	voice-control.md
API Reference	api.md

When to Use

Text-to-speech audio generation
Voice cloning from 5-15 seconds of audio
Emotion-controlled speech ([happy], [sad], etc.)
Word/phoneme timestamps for lip sync
Custom pronunciation with IPA

Models

Model	ID	Latency	Price
TTS 1.5 Max	`inworld-tts-1.5-max`	~200ms	$10/1M chars
TTS 1.5 Mini	`inworld-tts-1.5-mini`	~120ms	$5/1M chars

Minimal Example

import requests, base64, os

response = requests.post(
    "https://api.inworld.ai/tts/v1/voice",
    headers={"Authorization": f"Basic {os.getenv('INWORLD_API_KEY')}"},
    json={"text": "Hello!", "voiceId": "Ashley", "modelId": "inworld-tts-1.5-max"}
)
audio = base64.b64decode(response.json()['audioContent'])

Key Features

15 languages — en, zh, ja, ko, ru, it, es, pt, fr, de, pl, nl, hi, he, ar
Instant cloning — 5-15 seconds audio, no training
Audio markups — [happy], [laughing], [sigh] (English only)
Timestamps — word, phoneme, viseme timing for lip sync
Streaming — /voice:stream endpoint

Prohibitions

Audio markups work only in English
Use ONE emotion markup at text beginning
Match voice language to text language
Instant cloning may not work for children's voices or unique accents

inworld

Inworld AI

Quick Navigation

When to Use

Models

Minimal Example

Key Features

Prohibitions

Links