ai-voice-chat
Voice Loop
Hands-free voice conversation: speak → Whisper transcribes (local) → OpenClaw streams response (SSE) → Kokoro speaks sentence-by-sentence (local).
Architecture
Microphone → Whisper STT (local, ~2s) → OpenClaw API (cloud, ~4-10s) → Kokoro TTS (local, <1s) → Speakers
Streaming TTS speaks each sentence as it arrives — first audio in ~3s, not 13s.
Setup
Run the setup script to install dependencies and download models:
bash scripts/setup.sh
This creates a .venv, installs Python packages (numpy, sounddevice, soundfile, kokoro-onnx), and downloads Kokoro models (~136MB total).
Prerequisites
- macOS on Apple Silicon (M1–M4)
- Python 3.11+
- Whisper CLI:
brew install openai-whisper - OpenClaw running:
openclaw gateway status
Token Storage (Recommended: macOS Keychain)
Store your OpenClaw API token securely in macOS Keychain instead of plaintext:
security add-generic-password -a "$USER" -s "voice-loop-openclaw-token" -w "YOUR_TOKEN_HERE" -U
The voice loop reads from Keychain automatically. To also set the session target:
security add-generic-password -a "$USER" -s "voice-loop-session-to" -w "+1XXXXXXXXXX" -U
Alternatively, set environment variables (VL_OPENCLAW_API_TOKEN, VL_OPENCLAW_SESSION_TO) — these take precedence over Keychain if both exist.
Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
VL_OPENCLAW_API_TOKEN |
Yes* | Keychain | OpenClaw API token (from openclaw gateway status). *Falls back to Keychain voice-loop-openclaw-token |
VL_OPENCLAW_SESSION_TO |
Yes* | Keychain | Target phone number or user ID. *Falls back to Keychain voice-loop-session-to |
VL_OPENCLAW_API_URL |
No | http://127.0.0.1:18789/v1/chat/completions |
OpenClaw API endpoint (localhost only — remote endpoints are blocked) |
VL_SILENCE_THRESHOLD |
No | 0.015 |
RMS level for silence detection |
VL_SILENCE_DURATION |
No | 1.2 |
Seconds of silence before sending |
VL_KOKORO_SPEED |
No | 1.15 |
TTS playback speed (1.0 = natural) |
VL_DEFAULT_LANG |
No | en |
Starting language |
VL_DEFAULT_GENDER |
No | female |
Voice gender (female/male) |
VL_WHISPER_MODEL_EN |
No | base.en |
English Whisper model |
VL_WHISPER_MODEL_MULTI |
No | small |
Multilingual Whisper model |
VL_HEADSET_NAME |
No | AirPods |
Bluetooth device name substring to watch |
VL_POLL_INTERVAL |
No | 5 |
Headset poll interval in seconds |
Running
Manual start
export VL_OPENCLAW_API_TOKEN="your-token"
export VL_OPENCLAW_SESSION_TO="+1XXXXXXXXXX"
.venv/bin/python scripts/voice_loop.py
Auto-start on headset connect
.venv/bin/python scripts/airpods_watcher.py
The watcher polls audio devices every 5s. When a device matching VL_HEADSET_NAME appears as input, it starts the voice loop. On disconnect, it stops. On crash, it restarts.
Auto-start on boot (launchd)
Create ~/Library/LaunchAgents/com.voice-loop.watcher.plist:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.voice-loop.watcher</string>
<key>ProgramArguments</key>
<array>
<string>VENV_PYTHON_PATH</string>
<string>WATCHER_SCRIPT_PATH</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/tmp/voice-loop-watcher.log</string>
<key>StandardErrorPath</key>
<string>/tmp/voice-loop-watcher.log</string>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin</string>
</dict>
</dict>
</plist>
Replace VENV_PYTHON_PATH and WATCHER_SCRIPT_PATH. No tokens in the plist — they're read from Keychain at runtime. Then:
launchctl load ~/Library/LaunchAgents/com.voice-loop.watcher.plist
Language Switching
Say any of these mid-conversation:
- Spanish: "switch to Spanish", "Spanish mode", "habla en español"
- French: "switch to French", "French mode", "parle en français"
- Japanese: "switch to Japanese", "Japanese mode"
- Chinese: "switch to Chinese", "Chinese mode", "speak Mandarin"
- Back to English: "back to English", "English mode", "stop Spanish"
On switch: Whisper model changes (base.en → small multilingual), Kokoro voice and language change, LLM prompt includes language context. A spoken confirmation plays in the target language.
Voice Options
English: af_heart ⭐ (female), am_puck ⭐ (male), af_bella, af_nova, af_sarah, am_adam, am_eric Spanish: ef_dora ⭐ (female), em_alex (male) French: ff_siwis Japanese: jf_alpha (female), jm_beta (male) Chinese: zf_xiaobei (female), zm_yunjian (male)
Override default gender: export VL_DEFAULT_GENDER=male
Tuning
- Faster STT:
VL_WHISPER_MODEL_EN=tiny.en(~1s vs ~2s, less accurate) - Snappier response:
VL_SILENCE_DURATION=0.8(may cut off mid-pause) - Faster LLM: Use Sonnet instead of Opus (~4s vs ~10s)
- Noise issues: Raise
VL_SILENCE_THRESHOLD(try0.02or0.03) - Speed:
VL_KOKORO_SPEED=1.2for faster speech,1.0for natural
Troubleshooting
"Audio device issue" — Headphones not connected or not set as default. Check System Settings > Sound.
Empty transcriptions / hallucinations — Whisper generating phantom text from background noise. Script auto-filters utterances under 3 words and known hallucination phrases. Raise VL_SILENCE_THRESHOLD if persistent.
"Streaming error" — OpenClaw not running or token invalid. Check openclaw gateway status.
Kokoro model not found — Run bash scripts/setup.sh to download models to ~/.cache/kokoro-onnx/.
Multilingual transcription garbled — Whisper small model with real microphone input can struggle with non-English audio quality. Works well with clear audio. For best results: speak clearly, minimize background noise.
Cost
$0 for voice processing. Whisper and Kokoro run locally. Only cost is LLM API tokens (same as texting your agent).
Latency
| Component | Time | Local/Cloud |
|---|---|---|
| Whisper STT | ~2s | Local |
| LLM (Opus) | ~8-10s | Cloud |
| LLM (Sonnet) | ~3-5s | Cloud |
| Kokoro TTS | <1s | Local |
| First speech (streaming) | ~3s | — |
Security
- Tokens: Stored in macOS Keychain, not plaintext files or environment variables in plists. The script reads from Keychain at runtime via
security find-generic-password. - API endpoint locked to localhost: The script refuses to connect to non-local API endpoints. If
VL_OPENCLAW_API_URLpoints to a remote address, the script exits immediately. There is no override — localhost only. - Sandboxed subprocess environment: The watcher only passes
VL_*env vars,PATH, andHOMEto the voice loop process — not your full environment. - No arbitrary executable paths: Whisper and Python paths are resolved from PATH or the skill's own venv — not user-configurable via env vars.
- Audio: All audio capture and TTS happen locally. No audio data leaves your machine. Only the text transcription is sent to the LLM via localhost.
- No data exfiltration: The scripts make no network calls except to localhost. Remote endpoints cannot be enabled.
Credits
- Kokoro-ONNX — local neural TTS
- OpenAI Whisper — local speech-to-text
- OpenClaw — AI agent framework