skills/noizai/skills/chat-with-anyone

chat-with-anyone

SKILL.md

Chat with Anyone

Clone a real person's voice from online video, or design a voice from a photo, then roleplay as that person with TTS.

Prerequisites

  • youtube-downloader skill installed (Workflow A)
  • tts skill installed
  • ffmpeg on PATH
  • Noiz API key configured: python3 skills/tts/scripts/tts.py config --set-api-key YOUR_KEY

Mode Selection

  • User names a person (real or fictional) --> Workflow A
  • User provides an image, person is unrecognizable --> Workflow B
  • User provides an image, person is a recognizable public figure --> Workflow A (real voice is more authentic)
  • Multiple people in image --> Ask which person first

Workflow A: Name-based (voice from online video)

Track progress with this checklist:

- [ ] A1. Disambiguate character
- [ ] A2. Find reference video
- [ ] A3. Download audio + subtitles
- [ ] A4. Extract best reference segment
- [ ] A5. Generate speech

A1. Disambiguate Character

If ambiguous (e.g. "US President", "Spider-Man actor"), ask the user to specify the exact person before proceeding.

A2. Find a Reference Video

Use web search to find a YouTube video of the person speaking clearly. Best candidates: interviews, speeches, press conferences. Avoid videos with heavy background music.

Search queries to try:

  • {CHARACTER_NAME} interview / {CHARACTER_NAME} 采访
  • {CHARACTER_NAME} speech / {CHARACTER_NAME} 演讲
  • {CHARACTER_NAME} press conference

A3. Download Audio and Subtitles

python skills/youtube-downloader/scripts/download_video.py "{VIDEO_URL}" \
  -o "tmp/chat_with_anyone/{CHARACTER_NAME}" --audio-only --subtitles

After download, list the output directory to identify the audio file and SRT subtitle file:

ls tmp/chat_with_anyone/{CHARACTER_NAME}/

Expected output: a .mp3 audio file and one or more .srt subtitle files.

If no subtitle files appear: try a different video that has auto-generated captions, or add --sub-lang en,zh-Hans to request specific languages.

A4. Extract Best Reference Segment

Use the automated extraction script — it parses the SRT, finds the densest 3-12 second speech window, and extracts it as a WAV:

python3 skills/chat-with-anyone/scripts/extract_ref_segment.py \
  --srt "tmp/chat_with_anyone/{CHARACTER_NAME}/{SRT_FILE}" \
  --audio "tmp/chat_with_anyone/{CHARACTER_NAME}/{AUDIO_FILE}" \
  -o "tmp/chat_with_anyone/{CHARACTER_NAME}/ref.wav"

The script prints the selected time range and saves the reference WAV. Verify the output exists and is non-empty before proceeding.

If the script reports no suitable segment: try --min-duration 2 for shorter clips, or download a different video.

A5. Generate Speech and Roleplay

Write a response in character, then synthesize it:

python3 skills/tts/scripts/tts.py \
  -t "{RESPONSE_TEXT}" \
  --ref-audio "tmp/chat_with_anyone/{CHARACTER_NAME}/ref.wav" \
  -o "tmp/chat_with_anyone/{CHARACTER_NAME}/reply.wav"

Present the generated audio file to the user along with the text. For subsequent messages, reuse the same --ref-audio path.


Workflow B: Image-based (voice from photo)

Track progress with this checklist:

- [ ] B1. Analyze image
- [ ] B2. Design voice
- [ ] B3. Preview (optional)
- [ ] B4. Generate speech

B1. Analyze the Image

Use your vision capability to examine the image:

  1. If the person is a recognizable public figure --> switch to Workflow A for authentic voice.
  2. If unrecognizable, produce a voice description covering:
    • Gender (male / female)
    • Approximate age (e.g. "around 30 years old")
    • Apparent demeanor (e.g. cheerful, authoritative, gentle)
    • Contextual cues (e.g. suit --> professional tone; athletic outfit --> energetic)

B2. Design the Voice

Pass both the image and the description to the voice-design script:

python3 skills/chat-with-anyone/scripts/voice_design.py \
  --picture "{IMAGE_PATH}" \
  --voice-description "{VOICE_DESCRIPTION}" \
  -o "tmp/chat_with_anyone/voice_design"

The script outputs:

  • Detected voice features (printed to stdout)
  • Preview audio files in the output directory
  • voice_id.txt containing the best voice ID

Read the voice ID:

cat tmp/chat_with_anyone/voice_design/voice_id.txt

B3. Preview (Optional)

Present the preview audio files from the output directory so the user can hear the voice. If unsatisfied, re-run B2 with adjusted --voice-description or --guidance-scale.

B4. Generate Speech and Roleplay

python3 skills/tts/scripts/tts.py \
  -t "{RESPONSE_TEXT}" \
  --voice-id "{VOICE_ID}" \
  -o "tmp/chat_with_anyone/voice_design/reply.wav"

For subsequent messages, keep using the same --voice-id for consistency.


Example: Name-based

User: 我想跟特朗普聊天,让他给我讲个睡前故事。

Agent steps:

  1. Character: Donald Trump. No disambiguation needed.
  2. Search Donald Trump speech youtube, find a clear speech video.
  3. Download: python skills/youtube-downloader/scripts/download_video.py "https://youtube.com/watch?v=..." -o tmp/chat_with_anyone/trump --audio-only --subtitles
  4. Extract reference: python3 skills/chat-with-anyone/scripts/extract_ref_segment.py --srt "tmp/chat_with_anyone/trump/....srt" --audio "tmp/chat_with_anyone/trump/....mp3" -o "tmp/chat_with_anyone/trump/ref.wav"
  5. Generate TTS in Trump's style: python3 skills/tts/scripts/tts.py -t "Let me tell you a tremendous bedtime story..." --ref-audio "tmp/chat_with_anyone/trump/ref.wav" -o "tmp/chat_with_anyone/trump/reply.wav"
  6. Present reply.wav and the story text to the user.

Example: Image-based

User: [uploads photo.jpg] 我想跟这张图片里的人聊天

Agent steps:

  1. Vision analysis: unrecognizable young woman, ~25, casual sweater, warm smile.
  2. Design voice: python3 skills/chat-with-anyone/scripts/voice_design.py --picture "photo.jpg" --voice-description "A young Chinese woman around 25, gentle and warm voice, friendly tone" -o "tmp/chat_with_anyone/voice_design"
  3. Read voice ID from tmp/chat_with_anyone/voice_design/voice_id.txt.
  4. Generate TTS: python3 skills/tts/scripts/tts.py -t "你好呀!很高兴认识你!" --voice-id "{VOICE_ID}" -o "tmp/chat_with_anyone/voice_design/reply.wav"
  5. Present audio and continue roleplay with same --voice-id.

Troubleshooting

Problem Solution
Download fails or video unavailable Try a different video URL; some regions/videos are restricted
No SRT subtitle files Re-download with --sub-lang en,zh-Hans; if still none, try a different video with auto-captions
extract_ref_segment.py finds no suitable window Use --min-duration 2 for shorter clips, or try a different video
Voice design returns error Check Noiz API key; ensure image is a clear photo of a person
TTS output sounds wrong For Workflow A, try a different reference video; for Workflow B, adjust --voice-description
Weekly Installs
1.2K
Repository
noizai/skills
GitHub Stars
346
First Seen
13 days ago
Installed on
opencode1.2K
cursor1.2K
gemini-cli1.2K
cline1.2K
kimi-cli1.2K
amp1.2K