Voice Mode (Super-Skill)

Purpose

This skill unifies voice output and voice input in one place:

say — text-to-speech (TTS)
listen — speech-to-text (STT)
duplex mode — agent orchestration (say → listen) built on the atomic scripts

Use say and listen independently, or let the agent combine them into continuous duplex dialogue. This is an offline-first skill: STT runs locally via faster-whisper, and TTS uses local piper models after the initial voice download.

Atomic Commands

1) Speak

say "text to announce"
say --lang ru "<text in Russian>"
# short alias is also supported:
say -l ru "<text in Russian>"

2) Listen

listen

3) Duplex mode (agent orchestration)

say --lang ru "<spoken reply in the conversation language>"
listen -l ru -d 0 -s 1

Duplex mode is not a standalone shell script in this skill. Core protocol remains atomic: say then listen. In duplex sessions, prefer listen -d 0 -s 1: no hard timeout, stop by user pause.

Operating Modes

Mode A: Selective Voice (default)

Use say only for short, high-value moments (greeting, warning, key conclusion).
Keep code, tables, and long technical details in text.

Mode B: Full Voice Output (screenless)

When explicitly requested by the user:

Use say for every response.
Speak the entire assistant reply through say, not just a short follow-up question.
Do not duplicate full spoken content in chat.
For code/tables: describe briefly by voice (language, purpose, size), avoid reading raw code line by line.

Mode C: Voice Input On-Demand

Call listen when the user wants to dictate the next prompt.
listen prints recognized text to stdout.

Mode D: Duplex Continuous Dialogue (`say` → `listen`)

When the user enables duplex mode (e.g. "turn on duplex", "full voice mode"):

Generate the full assistant response first.
Speak the full response via say.
Immediately call listen -d 0 -s 1 in the same conversation language.
Treat recognized text as the next user prompt.
Normalize the recognized text and stop when a stop phrase intent is heard: стоп, выключи прослушивание, выключи дуплекс, stop listening.

Canonical agent loop:

answer = full assistant reply
say --lang <lang> "<answer>"
heard = listen -l <lang> -d 0 -s 1
if heard matches a stop phrase intent:
  exit duplex mode

This is a hands-free conversational flow owned by the agent, not by a dedicated shell helper. Never keep the substantive reply only in chat while sending a shorter handoff question to speech.

Mode E: Autonomous Voice Alerts (optional)

Short proactive announcements are allowed for:

long-running operations,
critical blockers/security issues,
required confirmation to proceed safely.

Keep alerts brief and informative.

Voice Guard + Listen Guard

Before say: ask if silence would hide important information. If not, do not speak.

Before listen: ask if voice input is actually needed right now. Do not invoke speculatively.

Language Memory

Preferred language is stored in ~/.pi_voice_lang.
Use short language codes: ru, en, de, ... (not ru_RU, en_US).
In duplex mode, keep say and listen -l <lang> aligned.
say auto-downloads missing Piper model on first use.

Initialization (Linux & macOS)

Run bootstrap once:

"${SKILL_DIR}/scripts/_bootstrap"

Bootstrap installs to ~/.local/bin:

say
listen
listen-server

Platform Support

Linux: piper + aplay, faster-whisper, arecord/pyaudio
macOS: piper + afplay, faster-whisper, sox/pyaudio

voice-mode

Voice Mode (Super-Skill)

Purpose

Atomic Commands

1) Speak

2) Listen

3) Duplex mode (agent orchestration)

Operating Modes

Mode A: Selective Voice (default)

Mode B: Full Voice Output (screenless)

Mode C: Voice Input On-Demand

Mode D: Duplex Continuous Dialogue (`say` → `listen`)

Mode E: Autonomous Voice Alerts (optional)

Voice Guard + Listen Guard

Language Memory

Initialization (Linux & macOS)

Platform Support

More from llblab/skills

evolve-context

cross-evolution

while-true

voice-mode

Voice Mode (Super-Skill)

Purpose

Atomic Commands

1) Speak

2) Listen

3) Duplex mode (agent orchestration)

Operating Modes

Mode A: Selective Voice (default)

Mode B: Full Voice Output (screenless)

Mode C: Voice Input On-Demand

Mode D: Duplex Continuous Dialogue (say → listen)

Mode E: Autonomous Voice Alerts (optional)

Voice Guard + Listen Guard

Language Memory

Initialization (Linux & macOS)

Platform Support

More from llblab/skills

evolve-context

cross-evolution

while-true

Mode D: Duplex Continuous Dialogue (`say` → `listen`)