gemini-tts
Gemini Text-to-Speech
Generate natural-sounding speech from text using Gemini's TTS models through executable scripts with support for multiple voices and multi-speaker conversations.
When to Use This Skill
Use this skill when you need to:
- Convert text to natural speech
- Create audio for podcasts, audiobooks, or videos
- Generate multi-speaker conversations
- Stream audio for long content
- Choose from multiple voice options
- Create accessible audio content
- Generate voiceovers for presentations
- Batch convert text to audio files
Available Scripts
scripts/tts.js
Purpose: Convert text to speech using Gemini TTS models
When to use:
- Any text-to-speech conversion
- Multi-speaker conversation generation
- Streaming audio for long texts
- Voiceovers for content creation
- Accessible audio generation
Key parameters:
| Parameter | Description | Example |
|---|---|---|
text |
Text to convert (required) | "Hello, world!" |
--voice, -v |
Voice name | Kore |
--output, -o |
Base name for output file | welcome |
--output-dir |
Output directory for audio | audio/ |
--no-timestamp |
Disable auto timestamp | Flag |
--model, -m |
TTS model | gemini-2.5-flash-preview-tts |
--stream, -s |
Enable streaming | Flag |
--speakers |
Multi-speaker mapping | "Joe:Kore,Jane:Puck" |
Output: WAV audio file path
Workflows
Workflow 1: Basic Text-to-Speech
node scripts/tts.js "Hello, world! Have a wonderful day."
- Best for: Quick audio generation, simple messages
- Voice:
Kore(default, clear and professional) - Output:
audio/tts_output_YYYYMMDD_HHMMSS.wav(auto timestamp)
Workflow 2: Choose Different Voice
node scripts/tts.js "Welcome to our podcast about technology trends" --voice Puck --output welcome
- Best for: Friendly, conversational content
- Voice options: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
- Output:
audio/welcome_YYYYMMDD_HHMMSS.wav
Workflow 3: Multi-Speaker Conversation
node scripts/tts.js "TTS the following conversation:
Joe: How's it going today?
Jane: Not too bad, how about you?
Joe: I'm working on a new project.
Jane: Sounds exciting, tell me more!" --speakers "Joe:Kore,Jane:Puck" --output conversation
- Best for: Dialogues, interviews, role-playing content
- Format: Marked conversation with speaker names
- Script automatically routes text to appropriate voices
- Output:
audio/conversation_YYYYMMDD_HHMMSS.wav
Workflow 4: Long Content with Streaming
node scripts/tts.js "This is a very long text that would benefit from streaming..." --stream --output long-form
- Best for: Podcasts, audiobooks, long articles
- Streaming: Processes audio in chunks for long texts
- Output:
audio/long-form_YYYYMMDD_HHMMSS.wav
Workflow 5: Professional Voiceover
node scripts/tts.js "Welcome to our quarterly earnings presentation. Today we'll discuss our growth metrics and future plans." --voice Charon --output voiceover
- Best for: Corporate content, presentations, formal announcements
- Voice:
Charon(deep, authoritative) - Use when: Professional, serious tone required
Workflow 6: Custom Output Directory
node scripts/tts.js "Save to specific folder." --output-dir ./my-projects/podcasts/ --output episode1
- Best for: Organized project structures
- Directory created automatically if it doesn't exist
- Output:
./my-projects/podcasts/episode1_YYYYMMDD_HHMMSS.wav
Workflow 7: Content Creation Pipeline (Text → Audio)
# 1. Generate script (gemini-text skill)
node skills/gemini-text/scripts/generate.js "Write a 2-minute podcast intro about sustainable energy"
# 2. Generate audio (this skill)
node scripts/tts.js "[Paste generated script]" --voice Fenrir --output podcast-intro
# 3. Use in video or podcast
- Best for: Podcasts, audiobooks, video narration
- Combines with: gemini-text for script generation
Workflow 8: Accessible Content
node scripts/tts.js "Welcome to our accessible website. This audio describes our main navigation options." --voice Aoede --output accessibility
- Best for: Web accessibility, screen reader alternatives
- Voice:
Aoede(melodic, pleasant) - Use when: Making content accessible to visually impaired users
Workflow 9: Educational Content
node scripts/tts.js "Chapter 1: Introduction to Quantum Computing. Let's explore the fundamental principles..." --voice Zephyr --output chapter1
- Best for: Educational materials, tutorials, e-learning
- Voice:
Zephyr(light, airy) - Combines well with: gemini-text for content generation
Workflow 10: Disable Timestamp
node scripts/tts.js "Fixed filename." --output my-audio --no-timestamp
- Best for: When you want complete control over filename
- Output:
audio/my-audio.wav(no timestamp) - Use when: Generating files for specific naming schemes
Parameters Reference
Model Selection
| Model | Quality | Speed | Best For |
|---|---|---|---|
gemini-2.5-flash-preview-tts |
Good | Fast | General use, high volume |
gemini-2.5-pro-preview-tts |
Higher | Slower | Premium content, voiceovers |
Voice Selection
| Voice | Characteristics | Best For |
|---|---|---|
| Kore | Clear, professional | Announcements, general purpose (default) |
| Puck | Friendly, conversational | Casual content, interviews |
| Charon | Deep, authoritative | Corporate, serious content |
| Fenrir | Warm, expressive | Storytelling, narratives |
| Aoede | Melodic, pleasant | Educational, accessibility |
| Zephyr | Light, airy | Gentle content, tutorials |
| Sulafat | Neutral, balanced | Documentaries, factual content |
Audio Format
| Specification | Value |
|---|---|
| Format | WAV (PCM) |
| Sample rate | 24000 Hz |
| Channels | 1 (mono) |
| Bit depth | 16-bit |
Token Limits
| Limit | Type | Description |
|---|---|---|
| 8,192 | Input | Maximum input text tokens |
| 16,384 | Output | Maximum output audio tokens |
Output Interpretation
Audio File
- Format: WAV (compatible with most players)
- Mono channel (single audio track)
- Sample rate: 24000 Hz (broadcast quality)
- Can be converted to MP3/AAC if needed
Multi-Speaker Files
- Single WAV file with multiple voices
- Voices separated by timing within file
- Use
--speakersparameter to map speakers to voices
Streaming Output
- Audio processed in chunks during generation
- Script shows "Streaming audio..." message
- Useful for very long texts or real-time applications
Common Issues
"google-genai not installed"
npm install @google/genai@latest dotenv@latest
"Voice name not found"
- Check voice name spelling
- Use available voices: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
- Voice names are case-sensitive
"No audio generated"
- Check text is not empty
- Verify text doesn't exceed token limit (8,192)
- Try shorter text segments
- Check API quota limits
"Multi-speaker format error"
- Format:
SpeakerName:VoiceName,Speaker2:Voice2 - Separate speakers with commas
- Use colon between speaker and voice
- Example:
"Joe:Kore,Jane:Puck,Host:Charon"
"Output file already exists"
- Script will overwrite existing files
- Change
--outputfilename to avoid conflicts - Use unique names for batch generation
Audio quality issues
- Check input text for unusual characters
- Try different voice for better pronunciation
- Consider splitting long text into smaller segments
- Verify audio playback software compatibility
Best Practices
Voice Selection
- Kore: General purpose, clear articulation
- Puck: Conversational, engaging tone
- Charon: Professional, authoritative
- Fenrir: Emotional, storytelling
- Aoede: Soft, gentle for accessibility
- Zephyr: Educational, clear explanations
Text Preparation
- Use natural language and punctuation
- Include pauses with commas and periods
- Spell out difficult words if needed
- Break very long text into logical segments
- Add speaker labels for multi-speaker content
Performance Optimization
- Use streaming for very long texts
- Generate shorter segments for better control
- Use flash model for faster generation
- Batch process multiple files for efficiency
Quality Tips
- Test different voices for your content type
- Use appropriate pacing with punctuation
- Consider context when selecting voice
- Listen to output before final use
- Multi-speaker requires clear speaker labeling
Use Cases by Voice
| Voice | Ideal Use Cases |
|---|---|
| Kore | Announcements, navigation, general info |
| Puck | Podcasts, interviews, casual content |
| Charon | Corporate, news, formal presentations |
| Fenrir | Audiobooks, stories, emotional content |
| Aoede | Accessibility, educational, gentle content |
| Zephyr | Tutorials, explanations, guides |
| Sulafat | Documentaries, factual presentations |
Related Skills
- gemini-text: Generate scripts and text for TTS
- gemini-image: Create visuals to accompany audio
- gemini-batch: Process multiple TTS requests efficiently
- gemini-files: Upload audio files for processing
Quick Reference
# Basic
node scripts/tts.js "Your text here"
# Custom voice
node scripts/tts.js "Your text" --voice Puck --output audio.wav
# Multi-speaker
node scripts/tts.js "Joe: Hi. Jane: Hello!" --speakers "Joe:Kore,Jane:Puck"
# Streaming
node scripts/tts.js "Long text..." --stream --output long.wav
# Professional
node scripts/tts.js "Corporate announcement" --voice Charon
Reference
- See
references/voices.mdfor complete voice documentation - Get API key: https://aistudio.google.com/apikey
- Documentation: https://ai.google.dev/gemini-api/docs/text-to-speech
- Sample rate: 24000 Hz standard for most applications
More from akrindev/google-studio-skills
gemini-image
Generate images using Google Gemini and Imagen models via scripts/. Use for AI image generation, text-to-image, creating visuals from prompts, generating multiple images, custom aspect ratios, and high-resolution output up to 4K. Triggers on "generate image", "create image", "imagen", "text to image", "AI art", "nano banana".
123gemini-embeddings
Generate text embeddings using Gemini Embedding API via scripts/. Use for creating vector representations of text, semantic search, similarity matching, clustering, and RAG applications. Triggers on "embeddings", "semantic search", "vector search", "text similarity", "RAG", "retrieval".
16gemini-files
Upload and manage files using Google Gemini File API via scripts/. Use for uploading images, audio, video, PDFs, and other files for use with Gemini models. Supports file upload, status checking, and file management. Triggers on "upload file", "file API", "upload image", "upload PDF", "upload video", "file management".
13gemini-batch
Process large volumes of requests using Gemini Batch API via scripts/. Use for batch processing, bulk text generation, processing JSONL files, async job execution, and cost-efficient high-volume AI tasks. Triggers on "batch processing", "bulk requests", "JSONL", "async job", "batch job".
13gemini-text
Generate text content using Google Gemini models via scripts/. Use for text generation, multimodal prompts with images, thinking mode for complex reasoning, JSON-formatted outputs, and Google Search grounding for real-time information. Triggers on "generate with gemini", "use gemini for text", "AI text generation", "multimodal prompt", "gemini thinking mode", "grounded response".
9