text-to-speech
Text-to-Speech Skill β Flagship
Domain: AI Accessibility & Communication
Inheritance: inheritable (promote to Master Alex for all heirs)
Version: 2.5.0
Last Updated: 2026-02-09
Author: Alex (Master Alex)
Status: β Flagship Skill - Core Alex capability
Why This is a Flagship Skill
Text-to-Speech gives Alex a voice. This transforms Alex from a text-only assistant into a multimodal companion that can:
- Read documents aloud while you walk, drive, or rest your eyes
- Proofread by ear - catch errors your eyes miss
- Accessibility - full document access for vision-impaired users
- Rehearsal - practice presentations with natural-sounding narration
- Export knowledge - create MP3s for offline learning
Zero cost, zero dependencies - uses Microsoft Edge TTS (free, no API key) with native TypeScript.
User Experience
π― Quick Start: Read Any Document
Keyboard shortcut (fastest):
- Open any document in VS Code
- (Optional) Select specific text to read only that portion
- Press
Ctrl+Alt+R(Windows/Linux) orCmd+Alt+R(macOS) - Audio begins playing through the webview player
Command palette:
Ctrl+Shift+Pβ "Alex: Read Aloud"
π Status Bar Feedback
The status bar shows real-time progress during TTS operations:
| State | Display | Click Action |
|---|---|---|
| Connecting | $(loading~spin) Connecting... |
- |
| Synthesizing | $(loading~spin) Synthesizing... |
- |
| Streaming | $(loading~spin) Receiving... 45KB |
- |
| Playing | $(unmute) Playing 35% |
Stop |
| Paused | $(unmute) Paused |
Stop |
π΅ Webview Audio Player
A sleek panel opens with full playback controls:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Alex TTS Player [Γ] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βΆοΈ βΉοΈ ββββββββββββββββββββββ 1:23 / 4:56 β
β β
β π βββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Features:
- Progress bar with scrubbing (click/drag to seek)
- Play/Pause button - toggle playback
- Stop button - ends playback and closes panel
- Volume slider - adjust playback volume
- Time display - current position / total duration
- Auto-close - panel closes when playback ends
π€ Voice Selection
Choose Alex's voice before reading:
Ctrl+Shift+Pβ "Alex: Read with Voice Selection"- Quick pick appears:
| Voice | Character | Best For |
|---|---|---|
| Default (GuyNeural) | Professional, clear | Technical docs, code review |
| Warm (ChristopherNeural) | Friendly, conversational | Tutorials, READMEs |
| British (RyanNeural) | Authoritative | Formal documents, presentations |
| Friendly (DavisNeural) | Casual, approachable | Chat logs, informal content |
- Select voice β reading begins immediately
πΎ Save as MP3
Export any document to audio file:
Ctrl+Shift+Pβ "Alex: Save as Audio"- Save dialog opens (default name based on document)
- Progress notification shows synthesis progress
- Success notification with options:
- Open File - plays in default audio player
- Open Folder - reveals in file explorer
Use cases:
- Create podcasts from documentation
- Generate audio for offline learning
- Archive presentations as audio
βΉοΈ Stop Reading
Multiple ways to stop playback:
- Click status bar (shows
$(unmute)icon during playback) - Press
Escapewhen reading - Click stop button in webview player
- Close webview panel
Ctrl+Shift+Pβ "Alex: Stop Reading"
π Smart Markdown Processing
Alex automatically strips markdown formatting for natural speech:
| You Write | Alex Reads |
|---|---|
# Heading |
"Heading." (pause) |
**bold text** |
"bold text" (slight emphasis) |
[link text](url) |
"link text" |
`code` |
"code" |
> blockquote |
"Quote: ..." |
--- |
(long pause) |
Symbol conversion:
| Symbol | Spoken As |
|---|---|
~5 minutes |
"about 5 minutes" |
50% |
"50 percent" |
A β B |
"A leads to B" |
Β±5% |
"plus or minus 5 percent" |
For Master Alex (Promotion Notes)
This skill gives Alex a voice. Version 2.0 uses native TypeScript WebSocket integration with Microsoft Edge TTS, eliminating external dependencies. Reading documents aloud with natural-sounding neural voices.
Version 2.0 Changes:
- Native TypeScript implementation (no Python/MCP dependencies)
- Direct WebSocket connection to Edge TTS endpoint
- Webview-based audio player (cross-platform)
- Integrated as VS Code commands
- Status bar progress feedback
Why promote to Master:
- Universal utility across all projects
- Zero-cost implementation (uses free Edge TTS API)
- No external dependencies (Python, MCP server)
- Accessibility benefits for vision-impaired users
- Integrated into VS Code extension
Dependencies (v2.0):
wsnpm package (WebSocket client)- VS Code webview API (for audio playback)
Overview
Alex's voice synthesis capability using Microsoft Edge TTS via native TypeScript. Enables reading markdown documents, code files, and text aloud with natural-sounding voices. Fully integrated into the VS Code extension.
Architecture (v2.0)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Alex VS Code Extension β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Commands: β
β β’ Alex: Read Aloud (Ctrl+Alt+R) β
β β’ Alex: Read with Voice Selection β
β β’ Alex: Save as Audio β
β β’ Alex: Stop Reading β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β ttsService.ts β β
β β Native WebSocket to Edge TTS β β
β β β’ SSML generation β β
β β β’ Markdown stripping β β
β β β’ Progress callbacks β β
β βββββββββββββββββββ¬ββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β audioPlayer.ts β β
β β Webview-based playback β β
β β β’ Cross-platform HTML5 Audio β β
β β β’ Play/pause/stop controls β β
β β β’ Progress tracking β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β WebSocket (wss://)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Microsoft Edge TTS Endpoint β
β wss://speech.platform.bing.com/consumer/speech/... β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β’ 400+ neural voices, 90+ languages β
β β’ Free, no API key required β
β β’ MP3 output (24kHz, 48kbps) β
β β’ SSML support for prosody control β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Alex Voice Presets
| Preset | Voice ID | Character |
|---|---|---|
| Default | en-US-GuyNeural | Professional male, clear articulation |
| Warm | en-US-ChristopherNeural | Friendly, conversational |
| British | en-GB-RyanNeural | British accent, authoritative |
| Friendly | en-US-DavisNeural | Casual, approachable |
Voice Selection Rationale
Alex's default voice (GuyNeural) was chosen for:
- Clarity: Excellent pronunciation of technical terms
- Neutrality: Not too formal, not too casual
- Distinctiveness: Recognizable as "Alex's voice"
- Consistency: Same voice across all platforms
VS Code Commands
Alex: Read Aloud
Command: alex.readAloud
Keybinding: Ctrl+Alt+R (Windows/Linux), Cmd+Alt+R (macOS)
Reads the current selection or entire document aloud using Alex's default voice.
Behavior:
- If text is selected, reads only the selection
- If no selection, reads the entire document from top
- If no editor open (v2.1.0): offers "Paste from Clipboard" or "Type Text" options
- Markdown files are stripped of formatting for natural speech
- Pasted/typed text is auto-detected for markdown patterns
- Progress shown in status bar
- Click status bar to stop playback
Alex: Read with Voice Selection
Command: alex.readWithVoice
Quick pick to select a voice preset before reading.
Alex: Save as Audio
Command: alex.saveAsAudio
Generate and save speech to an MP3 file. Opens a save dialog for output location.
Alex: Stop Reading
Command: alex.stopReading
Keybinding: Escape (when reading)
Immediately stops current playback.
Implementation Details
Core Files (src/tts/)
| File | Purpose |
|---|---|
ttsService.ts |
WebSocket connection, SSML generation, synthesis |
audioPlayer.ts |
Webview panel, playback controls, system fallback |
index.ts |
Module exports |
Text Preprocessing
The prepareTextForSpeech() function strips markdown:
| Markdown | Speech Output |
|---|---|
# Heading |
"Heading." (pause) |
**bold** |
"bold" (emphasis via prosody) |
*italic* |
"italic" |
`code` |
"code" |
[link]\(url\) |
"link" |
- item |
"Item." |
> quote |
"Quote: ..." |
--- |
(long pause) |
Code Block Handling
```python
def hello():
print("Hello")
Becomes: "Python code block. Definition hello. Print hello. End code block."
### Symbol-to-Speech Transformations
Symbols are converted to natural speech equivalents:
| Symbol | Spoken As | Example |
|--------|-----------|--------|
| `~` | "approximately" or "about" | ~2 min β "about 2 minutes" |
| `&` | "and" | A & B β "A and B" |
| `@` | "at" | user@email β "user at email" |
| `%` | "percent" | 50% β "50 percent" |
| `+` | "plus" | +10% β "plus 10 percent" |
| `β` | "leads to" or "becomes" | A β B β "A becomes B" |
| `β` | (pause) | wordβword β "word (pause) word" |
| `#` | (context-dependent) | #1 β "number 1"; ## β (heading marker) |
| `<` / `>` | "less than" / "greater than" | x > 5 β "x greater than 5" |
| `β₯` / `β€` | "greater than or equal" / "less than or equal" | |
| `Β΅` | "micro" | Β΅g β "microgram" |
| `Β°` | "degrees" | 37Β°C β "37 degrees celsius" |
| `Β±` | "plus or minus" | Β±5% β "plus or minus 5 percent" |
### Time Duration Patterns (v2.1.0)
| Input | Spoken As |
|-------|----------|
| `4h` | "4 hours" |
| `30m` | "30 minutes" |
| `15s` | "15 seconds" |
| `2d` | "2 days" |
| `1w` | "1 week" |
| `90min` | "90 minutes" |
### Emoji Pronunciation (v2.1.0)
| Emoji | Spoken As | Context |
|-------|-----------|--------|
| β
| "completed" | Status indicators |
| β | "not done" | Status indicators |
| β οΈ | "warning" | Alerts |
| π | "planned" | Task status |
| π | "in progress" | Task status |
| β³ | "waiting" | Task status |
| π₯ | "hot" or "high priority" | When followed by "High" |
| π | "unlocked" | Feature status |
| π‘ | "idea" | Suggestions |
| π | "new" | Version notes |
**Emoji-Text Deduplication**: When emoji meaning matches following text (e.g., `β
Complete`), only says it once ("completed", not "completed Complete").
### Table Reading (v2.1.0)
Markdown tables are converted to natural speech:
```markdown
| Name | Status |
|-------|----------|
| Alice | β
Done |
| Bob | π Active |
Becomes: "Table with 2 columns: Name, Status. Row 1: Name is Alice. Status is completed. Row 2: Name is Bob. Status is in progress."
Version Pattern Intelligence (v2.1.0)
Versions are spoken naturally with context awareness:
| Input | Spoken As | Why |
|---|---|---|
v4.2.9 |
"version 4.2.9" | Standalone version |
Version: v4.2.9 |
"Version: 4.2.9" | Already has "Version:" prefix |
Uses negative lookbehind to prevent redundant "version version".
Design Principle: Would a human reading this aloud say the symbol name, or translate it to meaning? Almost always the latter.
Reliability & Long Content Handling (v2.1.0)
The Problem
Edge TTS has undocumented size limits per WebSocket request. Documents over ~3000 characters (approximately 7 minutes of audio) can cause the connection to stall indefinitely, appearing to hang at "Synthesizing..." with no progress.
The Solution: Chunking with Retry
Chunking Strategy:
| Setting | Value | Rationale |
|---|---|---|
MAX_CHUNK_CHARS |
3000 | Safe limit before Edge TTS stalls |
CHUNK_TIMEOUT_MS |
60000 | 60 seconds per chunk |
MAX_RETRIES |
3 | Retry failed chunks |
Chunk Splitting Logic:
- Split at paragraph boundaries (
\n\n) first - If still too long, split at sentence boundaries (
.or!or?) - Progress displayed as
Synthesizing speech [n/N]...
Retry with Exponential Backoff:
| Attempt | Delay | Formula |
|---|---|---|
| 1 | ~1s | 1000 + jitter |
| 2 | ~2s | 2000 + jitter |
| 3 | ~4s | 4000 + jitter |
Jitter (0-500ms random) prevents thundering herd on concurrent requests.
Long Content Summarization
For documents over 5 minutes (~750 words), Alex offers to summarize before reading:
This document is approximately 32 minutes long (~4800 words).
Would you like to:
- Read full content (~32 min)
- Summarize for speech (~3 min) β Recommended
Summarization uses the VS Code Language Model API (GPT-4o preferred) with a target of ~450 words (~3 minutes).
Speaker Warmup Delay
Bluetooth and USB speakers often need time to "wake up" from power-saving mode. A 2-second delay before playback starts ensures the first words aren't clipped:
const SPEAKER_WARMUP_MS = 2000;
// Status shows "Preparing speakers..." during delay
Installation (v2.0)
TTS v2 is built into the Alex VS Code extension. No separate installation required.
Package Dependencies
The extension automatically includes:
ws(WebSocket client for Edge TTS connection)fs-extra(file operations for audio saving)
Verification
After extension update, verify TTS works:
- Open any document
- Press
Ctrl+Alt+R(Windows/Linux) orCmd+Alt+R(macOS) - Status bar should show "$(unmute) Synthesizing..."
- Audio should play through webview panel
Usage Patterns
Read Current Document
Press Ctrl+Alt+R to read document aloud
Select text first to read only selection
Generate Audio File
Command Palette β "Alex: Save as Audio"
Choose output location β MP3 saved
Voice Customization
Command Palette β "Alex: Read with Voice Selection"
Choose: Default | Warm | British | Friendly
Edge TTS Technical Reference
WebSocket Endpoint
wss://speech.platform.bing.com/consumer/speech/synthesize/readaloud/edge/v1
?TrustedClientToken=6A5AA1D4EAFF4E9FB37E23D68491D6F4
&ConnectionId=[UUID]
Audio Format
- Codec: MP3
- Sample Rate: 24kHz
- Bitrate: 48kbps mono
SSML Template
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-GuyNeural">
<prosody rate="+0%" pitch="+0Hz" volume="+0%">
Text content here
</prosody>
</voice>
</speak>
Multi-Language Support (v2.1.0)
Alex automatically detects the language of your text and selects an appropriate voice.
Detection Strategy:
| Detection Type | Languages | Method |
|---|---|---|
| Character-based | Chinese, Japanese, Korean, Arabic, Hebrew, Thai, Hindi, Russian, Greek, Vietnamese | Script/Unicode ranges |
| Word-pattern | Spanish, French, German, Portuguese, Italian, Dutch, Polish, Swedish, Norwegian, Finnish, Danish, Turkish, Indonesian, Malay, Tagalog, Romanian, Czech, Hungarian | Common word markers |
User Prompt Fallback:
- If detection confidence < 15%, Alex prompts you to select the language
- Quick pick shows top language options plus "Other (English default)"
32 Supported Languages:
| Language | Voice | Locale |
|---|---|---|
| English (US) | GuyNeural | en-US |
| English (UK) | RyanNeural | en-GB |
| English (AU) | WilliamNeural | en-AU |
| Spanish | AlvaroNeural | es-ES |
| French | HenriNeural | fr-FR |
| German | ConradNeural | de-DE |
| Portuguese (BR) | AntonioNeural | pt-BR |
| Italian | DiegoNeural | it-IT |
| Dutch | MaartenNeural | nl-NL |
| Polish | MarekNeural | pl-PL |
| Russian | DmitryNeural | ru-RU |
| Japanese | KeitaNeural | ja-JP |
| Korean | InJoonNeural | ko-KR |
| Chinese (Mandarin) | YunxiNeural | zh-CN |
| Chinese (Taiwan) | YunJheNeural | zh-TW |
| Arabic | HamedNeural | ar-SA |
| Hindi | MadhurNeural | hi-IN |
| Vietnamese | NamMinhNeural | vi-VN |
| Thai | NiwatNeural | th-TH |
| Turkish | AhmetNeural | tr-TR |
| Swedish | MattiasNeural | sv-SE |
| Norwegian | FinnNeural | nb-NO |
| Danish | JeppeNeural | da-DK |
| Finnish | HarriNeural | fi-FI |
Accessibility Benefits
| Use Case | Benefit |
|---|---|
| Vision impaired | Full document access via audio |
| Multitasking | Review code while walking/driving |
| Learning | Auditory reinforcement of reading |
| Proofreading | Catch errors by hearing text |
| Long documents | Listen during breaks |
Version History
v2.5.0 (2026-02-09)
- Speak Prompt command - Generate content via LLM then read aloud ("read me a poem")
- Voice Mode summarization - Responses >750 words auto-summarized before reading
- Keyboard shortcuts - Ctrl+Alt+R/V/P + Escape context-aware stop
- Rich markdown tooltips - Status bar shows all shortcuts
- Emoji notifications - User-friendly feedback (β β οΈ π π π π πΎ π π)
- CSP-compliant webview - data-cmd pattern replaces inline onclick
- Enhanced quick picks - Cross-command navigation when no document open
- 35 unit tests - Language detection, voice selection, markdown processing
v2.4.0 (2026-02-06)
- Multi-language auto-detection (32 languages)
- Character-based detection for non-Latin scripts (CJK, Cyrillic, Arabic, etc.)
- Word-pattern detection for Latin-script languages
- User prompt fallback when detection confidence < 15%
- Dynamic SSML
xml:langattribute for optimal pronunciation
v2.0.0 (2026-02-05)
- Native TypeScript implementation
- Removed Python/MCP server dependencies
- Webview-based cross-platform audio player
- VS Code command integration
- Status bar progress feedback
v1.1.0 (2026-02-05)
- Added Alex voice presets
- Enhanced markdown stripping
- Symbol to speech conversion
v1.0.0 (2026-02-04)
- Initial implementation via MCP server
- Python edge-tts integration
- Basic markdown support
Synapses
- accessibility: Primary use case enabler
- vscode-extension-patterns: Extension command patterns, CSP-compliant webview
- markdown-mermaid: Source content processing
- academic-research: Document reading for research projects
- gamma-presentations: Audio playback of pitch content for rehearsal and delivery
- project-management: Stakeholder pitch presentations generated as audio files
- ux-design: Emoji feedback patterns, keyboard shortcut UX
- testing-strategies: Unit test patterns for language detection
Future Enhancements
| Feature | Status | Notes |
|---|---|---|
| Real-time streaming | β Done | Chunked synthesis with progress |
| SSML support | Planned | Fine-grained prosody control |
| Section navigation | Planned | "Skip to next heading" |
| Bookmark resume | Planned | Resume from last position |
| Speed presets | Planned | 1x, 1.5x, 2x reading speeds |