vhscli
vhscli
vhscli is a command-line tool for multimodal AI: chat about
text/images/video/pdfs, or generate images and videos from prompts. It's a thin
client — auth, uploads, and model execution all happen server-side, so users
don't store any provider API keys locally.
Run vhscli --help or vhscli <command> --help to see current help — the CLI
is the source of truth.
Invocation
Always run via npx @getvhs/vhscli@latest so you pick up the newest models,
flags, and fixes. Don't pin a version, and don't call a bare vhscli binary
even if one is on PATH — it may be stale.
npx @getvhs/vhscli@latest <command> ...
Throughout this doc, commands are written as vhscli ... for readability —
substitute npx @getvhs/vhscli@latest ... when running.
Requires Node.js ≥ 22. file is needed for MIME detection; sips (ships with
macOS) for image conversion; ffmpeg for cross-format video conversion.
Top-level
vhscli [-v|--version] [-h|--help]
vhscli <command> [options] ...
-v,--version— print version (only when no command is given)-h,--help— show help (works on root and every subcommand)
Commands:
login— log in with google (opens browser; saves session to~/.vhs/session.json)logout— log out and delete local access tokenswhoami— print the logged-in user's emailmodels— list available modelsgenerate <model> <prompt> [-o <path>]— generate an image or videochat <prompt>— chat with seed-2.0 (text, image, video, or pdf input)resume <task_id> [-o <output>]— finish a generation that was aborted, by task id
Auth
Assume auth is already configured. If a command fails with an auth error, run
vhscli login to open a browser for Google OAuth. Do NOT run vhscli login
preemptively — it requires interactive browser login.
Models
- Chat / understand (text / image / video / pdf):
seed-2.0— undervhscli chat - Generate images:
seedream-5(default),seedream-4-5,nano-banana-2,nano-banana-pro,gpt-image-2— undervhscli generate - Generate video:
seedance-2— undervhscli generate
Prompt guides
Before you invoke vhscli generate (or do non-trivial understanding with
vhscli chat), Read the matching prompt guide first and shape the prompt
around it. The guides are concise, model-specific references distilled from
each provider's docs — formulas, what to lead with, what works, what fails.
Wording that's great for one model often underperforms on another, so don't
skip this.
| Model(s) | Guide file (Read before prompting) |
|---|---|
seed-2.0 (used by vhscli chat) |
prompt_guide/seed-2.txt |
seedream-5, seedream-4-5 |
prompt_guide/seedream.txt |
nano-banana-2, nano-banana-pro |
prompt_guide/nano-banana.txt |
seedance-2 |
prompt_guide/seedance-2.txt |
gpt-image-2 |
prompt_guide/gpt-image-2.txt |
Trigger: any time the user asks for output from one of these models, Read its
guide before building the prompt. For trivial chat (plain text Q&A with no
media) you can skip seed-2.txt.
Stdin prompts
Every command that takes a prompt also accepts - as the prompt, meaning
"read from stdin":
cat my_prompt.txt | vhscli generate nano-banana-pro -
echo "what is this?" | vhscli chat - -i photo.jpg
vhscli chat — chat about text, images, video, or pdfs
vhscli chat <prompt> [-i <image>...] [-f <pdf>...] [-v <video>] [--fps <n>]
Mode is picked from your flags:
- prompt only → text chat
-i→ ask about images (repeatable)-f→ ask about pdf documents (repeatable)-v→ ask about a single video
Options:
-i,--image <path>— image to ask about (repeat-ifor more)-f,--file <path>— pdf document to ask about (repeat-ffor more)-v,--video <path>— single video to ask about--fps <n>— frames/sec sampled from the video, 0.2–5 (default: 1)
One-shot — each call is independent, no memory of previous calls. Output goes to stdout, nothing is saved to disk. Audio inside a video is not understood.
Examples:
vhscli chat "explain how to make sourdough in 5 steps"
vhscli chat "describe the scene. return json with objects, setting, mood." -i photo.jpg
vhscli chat "transcribe all visible text verbatim, preserving line breaks." -i receipt.jpg
vhscli chat "compare image 1 and image 2 in 3 bullets." -i a.jpg -i b.jpg
vhscli chat "summarize this paper in 5 bullets; include a page number per bullet." -f paper.pdf
vhscli chat "list key events with start_time and end_time in HH:mm:ss as json." -v clip.mp4 --fps 2
vhscli generate seedream-5 — generate an image (default choice)
vhscli generate seedream-5 <prompt> [-o <path>] [-i <image>...] [--size <size>]
Options:
-o,--output <path>— output file path (default:./vhscli-seedream-5-<timestamp>.jpg)-i,--image <path>— reference image, max 14 (repeat-ifor more)--size <size>—2K,3K, orWxHlike1024x1536(default: 2K)- WxH pixel count must be in [3,686,400, 10,404,496]
- WxH aspect ratio must be in [1:16, 16:1]
Output format is determined by the output path extension (.png,
.jpg/.jpeg, .webp). The provider returns png or jpeg; the cli transcodes
via sips if the extension differs.
Examples:
vhscli generate seedream-5 "a red fox in a snowy forest" -o fox.jpg
vhscli generate seedream-5 "swap the outfit" -o out.png -i person.jpg -i outfit.jpg --size 3K
vhscli generate seedream-4-5 — generate an image (larger size range)
vhscli generate seedream-4-5 <prompt> [-o <path>] [-i <image>...] [--size <size>]
Options:
-o,--output <path>— output file path (default:./vhscli-seedream-4-5-<timestamp>.jpg)-i,--image <path>— reference image, max 14 (repeat-ifor more)--size <size>—2K,4K, orWxH(default: 2K)- WxH pixel count must be in [3,686,400, 16,777,216]
- WxH aspect ratio must be in [1:16, 16:1]
Example:
vhscli generate seedream-4-5 "a mountain at sunrise" -o mountain.jpg --size 4K
vhscli generate nano-banana-2 — generate an image (Google, with search grounding)
vhscli generate nano-banana-2 <prompt> [-o <path>] [-i <image>...]
[--ratio <r>] [--size <size>]
[--think <level>] [--search] [--image-search]
Options:
-o,--output <path>— output file path (default:./vhscli-nano-banana-2-<timestamp>.png)-i,--image <path>— reference image, max 14 (repeat-ifor more)--ratio <r>— aspect ratio (default: 1:1). one of:1:1,1:4,1:8,2:3,3:2,3:4,4:1,4:3,4:5,5:4,8:1,9:16,16:9,21:9--size <size>—512,1K,2K, or4K(default: 1K)--think <level>— how hard the model thinks:minimalorhigh(default: minimal)--search— use google search while generating--image-search— also use google image search (implies--search)
Examples:
vhscli generate nano-banana-2 "90s skateboarder poster" -o poster.png --ratio 9:16 --size 2K
vhscli generate nano-banana-2 "diagram of the latest iphone" -o d.png --image-search
vhscli generate nano-banana-2 "a typographic poster spelling 'NEW YORK' over a skyline" --think high
vhscli generate nano-banana-pro — generate an image (Google, premium)
vhscli generate nano-banana-pro <prompt> [-o <path>] [-i <image>...] [--ratio <r>] [--size <size>]
Options:
-o,--output <path>— output file path (default:./vhscli-nano-banana-pro-<timestamp>.png)-i,--image <path>— reference image, max 14 (repeat-ifor more)--ratio <r>— aspect ratio (default: 1:1). one of:1:1,2:3,3:2,3:4,4:3,4:5,5:4,9:16,16:9,21:9--size <size>—1K,2K, or4K(default: 1K)
Higher-quality sibling of nano-banana-2 — better text rendering, richer
textures. No --search or --think flags.
Example:
vhscli generate nano-banana-pro "studio portrait, cinematic lighting" -o portrait.jpg --ratio 3:4 --size 2K
vhscli generate gpt-image-2 — generate or edit an image (OpenAI)
vhscli generate gpt-image-2 <prompt> [-o <path>] [-i <image>...] [--mask <path>] [--size <size>]
Options:
-o,--output <path>— output file path (default:./vhscli-gpt-image-2-<timestamp>.png)-i,--image <path>— reference image for edits (repeat-ifor more). switches to the edit endpoint--mask <path>— edit mask (png with transparent pixels marking edit regions); requires-i--size <size>— preset (auto,1024x1024,1536x1024,1024x1536,2048x2048,2048x1152,3840x2160) orWxH(default: auto)- both sides must be multiples of 16, max edge 3840
- total pixels in [655,360, 8,294,400]
- aspect ratio in [1:3, 3:1]
Output format is derived from the -o extension (.png, .jpg/.jpeg,
.webp) and sent to the provider — no local transcode. Use png or webp for
transparent backgrounds.
Examples:
vhscli generate gpt-image-2 "a children's book drawing of a veterinarian examining a cat"
vhscli generate gpt-image-2 "replace the background with a starry night, keep the subject unchanged" -i photo.jpg
vhscli generate gpt-image-2 "add a red balloon in the masked area" -i room.png --mask hole.png
vhscli generate gpt-image-2 "ultra-wide landscape of the swiss alps at golden hour" --size 3840x2160 -o alps.jpg
vhscli generate seedance-2 — generate a video
vhscli generate seedance-2 <prompt> [-o <path>]
[--first-frame <image>] [--last-frame <image>]
[-i <image>...] [-v <video>...] [-a <audio>...]
[--ratio <r>] [--resolution <res>] [--duration <n>]
[--silent] [--seed <n>]
Mode is picked from your flags:
- prompt only → text-to-video
--first-frame→ animate from that frame (optionally--last-frametoo)-i/-v/-a→ use as references
Options:
-o,--output <path>— output file path (default:./vhscli-seedance-2-<timestamp>.mp4)--first-frame <image>— use as the first frame--last-frame <image>— use as the last frame (requires--first-frame)-i,--image <path>— reference image, max 9 (repeat-i). conflicts with--first-frame-v,--video <path>— reference video, max 3 (repeat-v)-a,--audio <path>— reference audio, max 3 (repeat-a). requires-ior-v--ratio <r>— aspect ratio (default: adaptive). one of:16:9,4:3,1:1,3:4,9:16,21:9,adaptive--resolution <res>—480p,720p, or1080p(default: 720p)--duration <n>— length in seconds, 4–15, or-1for auto (default: 5)--silent— make a silent video (no audio track)--seed <n>— random seed for reproducible output
Defaults to 5s @ 720p with audio. Jobs run in the cloud and can take minutes —
the CLI polls automatically, but if it's interrupted, save the printed
task_id and use vhscli resume <task_id> later.
Examples:
# text-to-video
vhscli generate seedance-2 "a cat jumping off a couch" -o cat.mp4 --duration 6 --ratio 16:9
# animate a still image
vhscli generate seedance-2 "camera pans right" -o pan.mp4 --first-frame start.jpg
# with a first and last frame
vhscli generate seedance-2 "morph between these" -o morph.mp4 --first-frame a.jpg --last-frame b.jpg
# reference-based with audio
vhscli generate seedance-2 "lip sync the words" -o out.mp4 -i face.jpg -a voice.mp3
vhscli resume — finish an aborted generation
vhscli resume <task_id> [-o <output>]
Every vhscli generate command prints a line task_id: <uuid> to stdout
before kicking off the backend task. If the cli process is aborted
mid-generation (ctrl-c, crash, closed terminal, lost network), the backend
keeps running; save that id and later run vhscli resume <task_id> to wait
for it and download the result.
Behavior:
- Polls the task row until it has a result or an error, then saves the result.
- The cli dispatches on the task's endpoint and saves with the right model's logic.
- Output extension is decided by the model (
mp4for videos,png/jpgfor images); pass-oto override the path/extension (transcoded if needed). - Exit code is non-zero on task error, missing task, or save failure.
vhscli chatdoes not create a resumable task — chat is fast and streams to stdout.
Example:
# kick off a generation, note the printed task_id, then if it aborts:
vhscli resume 8f3a1b2c-9e0f-4a1b-9c8d-1e2f3a4b5c6d -o cat.mp4
Understanding local images, video, and pdfs
Do NOT use the Read tool, or any built-in file-reading capability, to "look
at" images, video, or pdfs. That path either fails or gives you a garbled
snippet. The only correct way to understand local visual or document content
is vhscli chat with -i / -v / -f.
vhscli chat "what's happening?" -i photo.jpg
vhscli chat "transcribe the speech" -v clip.mp4 --fps 2
vhscli chat "summarize this paper" -f paper.pdf
Prompt patterns for visual / document understanding
vhscli chat understands images, pdfs, and video frames, but not audio
inside videos. Ask for structured JSON output when you'll parse the
answer, and name every field you want. Be explicit about formats
(timestamp style, units, language).
Image — describe / classify:
vhscli chat "describe the scene. return json {objects:[{label,bbox?}], setting, mood, dominant_colors:[]}." -i photo.jpg
vhscli chat "classify the image into one of: cat, dog, bird, other. return json {label, confidence_0_1, reasoning}." -i pic.jpg
Image — OCR / text extraction:
vhscli chat "transcribe all visible text verbatim, preserving line breaks and reading order. do not paraphrase." -i receipt.jpg
vhscli chat "extract the receipt as json {merchant, date_iso, items:[{name, qty, unit_price, line_total}], subtotal, tax, total, currency}." -i receipt.jpg
Image — comparison (number them in the prompt):
vhscli chat "compare image 1 and image 2. return json {same_subject:bool, differences:[], which_is_better, why}." -i a.jpg -i b.jpg
vhscli chat "image 1 is the original, image 2 is an edit. list every visible change as json {changes:[{region, before, after}]}." -i orig.png -i edit.png
PDF — summarize / outline (always ask for page anchors):
vhscli chat "summarize this paper in 5 bullets. each bullet must include the source page as {page:int, point:string}. return json {bullets:[...]}." -f paper.pdf
vhscli chat "extract the outline as json [{page, heading_level, heading, bullets:[]}]." -f doc.pdf
PDF — QA / extraction:
vhscli chat "answer using only this document. question: what is the experimental setup? return json {answer, citations:[{page, quote}]}." -f paper.pdf
vhscli chat "extract every table as json [{page, title?, headers:[], rows:[[...]]}]." -f report.pdf
Video — events / timeline (state the timestamp format):
vhscli chat "list key events. return json [{start_time, end_time, event}]. use HH:mm:ss." -v clip.mp4 --fps 2
vhscli chat "describe the movement sequence and any safety risks. return json [{start_time, end_time, event, danger:'none'|'low'|'med'|'high'}]. HH:mm:ss." -v clip.mp4 --fps 3
Video — temporal QA / counting:
vhscli chat "at what timestamp does the referee first appear? return json {timestamp_hms, evidence}." -v match.mp4 --fps 2
vhscli chat "count how many distinct people appear. return json {count, per_person:[{first_seen_hms, description}]}." -v scene.mp4 --fps 3
Choosing --fps for video (default 1, range 0.2–5):
- 3–5 — counting actions, sports, fast cuts, dense motion.
- 1 — general description, dialogue scenes.
- 0.2–0.5 — long static footage, headcount, slow surveillance.
Higher fps = more detail but more tokens and slower. Lower fps = cheaper but may miss brief events.
Tips
- Always quote prompts.
-ois optional forvhscli generate— defaults to./vhscli-<model>-<timestamp>.<ext>in the current folder. Output format is detected from the-oextension; mismatches are transcoded viasips(images) orffmpeg(videos).- Short options accept no-space form:
-ofoo.jpg. Long options accept=:--size=2K. - Use
--to pass a prompt starting with a dash:vhscli generate seedream-5 -o x.jpg -- "-weird prompt". - Reference images (
-i,--first-frame,--last-frame) can be any common format — the cli detects the real mime viafileand auto-converts non-jpeg/png inputs (e.g. heic, webp, tiff, bmp) to jpeg viasipsbefore upload. - Uploads are deduplicated by content hash, so passing the same reference repeatedly is cheap.
- Unknown command?
vhscliwill suggest the closest match.