Video Generator

Generate professional short-form videos using Google VEO 3.1 (native audio), OpenAI Sora (visual quality, up to 12s), or Kling v3 Pro (image-to-video, up to 15s, native audio).

Prerequisites & Setup

API Keys

VEO (Google): Uses GEMINI_API_KEY (already in ~/.zshrc).

Sora (OpenAI): Uses OPENAI_API_KEY (already in OpenEd Vault/.env).

Kling (Fal.ai): Uses FAL_KEY (in ~/.zshrc and OpenEd Vault/.env).

export GEMINI_API_KEY=your_gemini_key_here
export OPENAI_API_KEY=your_openai_key_here
export FAL_KEY=your_fal_key_here

Install Dependencies

pip install google-genai requests

Available Models

Provider	Model	CLI `--model`	Best For
VEO	Veo 3.1 Standard	`standard` (default)	Quality, audio fidelity, final assets
VEO	Veo 3.1 Fast	`fast`	Drafts, iteration, quick previews
Sora	Sora 2	`sora-2` (default)	Visual quality, creative motion
Sora	Sora 2 Pro	`sora-2-pro`	Highest Sora quality, slower
Kling	v3 Pro	`kling-v3-pro` (default)	Image-to-video, native audio, up to 15s
Kling	v3 Standard	`kling-v3-std`	Budget-friendly Kling
Kling	v2 Master	`kling-v2`	Stable, proven model
Kling	v1.5 Pro	`kling-v1.5`	Legacy, cheapest

When to Use Which

Need	Use
Native synchronized audio (dialogue, SFX)	VEO or Kling v3
Longest clips (up to 15 seconds)	Kling v3 (VEO: 8s, Sora: 12s)
Image-to-video (animate a still)	Kling (best) or VEO
Higher visual fidelity / artistic styles	Sora - stronger on visual aesthetics
Fast iteration / drafts	VEO Fast - quickest turnaround
4K resolution	VEO - Sora/Kling use fixed sizes
Negative prompts (exclude elements)	VEO or Kling - Sora doesn't support them
Square format (1:1)	Kling only

Video Parameters

Parameter	VEO Options	Sora Options	Kling Options	Default
Duration	4, 6, 8s	4, 8, 12s	v3: 3-15s, v2: 5/10s	8s (VEO/Sora), 5s (Kling)
Resolution	720p, 1080p, 4K	Fixed	Fixed	720p
Aspect Ratio	16:9, 9:16	16:9, 9:16	16:9, 9:16, 1:1	16:9
Count	1-4	1-4	1-4	1
Negative Prompt	Yes	No	Yes	none
Image Input	Yes	No	Yes (best)	none
Audio Generation	Native	No	v3 only (`--audio`)	off

Kling Pricing (Fal.ai on-demand)

Model	Rate
v3 Pro (no audio)	~$0.11/second ($0.56 for 5s)
v3 Pro (with audio)	~$0.17/second ($0.84 for 5s)
v2 Master	~$1.40 per 5s

Latency

Video generation is async — expect 11 seconds to 6 minutes depending on server load and provider. The script polls automatically and saves when ready.

Workflow Overview

Define the Concept — What story does the video tell in 4-8 seconds?
Storyboard the Shot — Camera, motion, subject, environment
Add Audio Direction — Dialogue, sound effects, ambient sound
Generate Video — Run via API
Iterate — Adjust prompt based on results

Prompt Rules (From Research)

Before diving into the workflow, internalize these rules from extensive testing across Veo, Runway, and Sora:

150-300 characters is the sweet spot. Under 100 = generic. Over 400 = the model drops elements unpredictably.
One shot = one action. Don't pack multiple scene changes or style shifts into one prompt. One camera move + one subject action.
Describe what you want, not what you don't want. Use the --negative flag for exclusions, not the main prompt.
Treat audio as a separate layer. Write audio cues in their own sentences, not mixed into visual descriptions.
Use colon syntax for dialogue. A man says: "Hello!" prevents subtitle artifacts. Without the colon, text may appear on screen.
Keep dialogue under 7 words per line. Longer speech causes lip-sync drift or rushed garbling.
Start simple, then layer. Begin with a basic prompt, evaluate, then add one variable at a time.
Slow camera movements win. Fast pans and spins break output. Use tight framing for perceived speed.

See references/prompt-engineering-research.md for the complete research.

Step 1: Define the Concept

A good video prompt answers three questions:

What's happening? (action/motion)
Where? (environment/setting)
What does it sound like? (audio landscape)

Unlike images, video is temporal. Think in terms of movement and change, not a static composition.

Good concepts for 8-second clips:

Use Case	Example Concept
Social teaser	A hand flipping through pages of a book, stopping on a highlighted passage
Ambient background	Rain falling on a window with city lights blurring behind it
Product reveal	Camera slowly orbits a product on a table, warm studio lighting
Podcast promo	A microphone in a cozy studio, coffee steam rising, morning light
Newsletter visual	A typewriter striking keys, with the sound of each keystroke

Step 2: Storyboard the Shot

Structure your prompt with cinematic language. Veo responds well to film terminology:

Camera Language

Term	Effect
Wide shot	Shows full environment, establishes context
Close-up	Tight on a subject, emphasizes detail
Tracking shot	Camera follows subject movement
Dolly in/out	Camera moves toward or away from subject
Static shot	Locked camera, subject moves within frame
Slow pan	Camera rotates horizontally across scene
Overhead / bird's eye	Looking straight down
Low angle	Looking up at subject, adds drama

Motion Description

Be explicit about what moves and how:

Don't	Do
"A dog in a park"	"A golden retriever runs toward camera through tall grass, ears bouncing"
"City at night"	"Camera slowly dollies through a neon-lit Tokyo alley as rain puddles reflect signs"
"Ocean"	"A single wave forms, curls, and crashes onto wet sand in slow motion"

Lighting & Atmosphere

Term	Mood
Golden hour	Warm, nostalgic, cinematic
Overcast	Soft, even, contemplative
Neon / artificial	Urban, energetic, modern
Candlelight	Intimate, quiet
Hard shadows	Dramatic, high contrast

Step 3: Add Audio Direction

Veo 3.1 generates synchronized audio natively. This is a major differentiator — use it.

Three Types of Audio Cues

1. Dialogue — Use colon syntax before quotes (prevents subtitle artifacts):

A barista says: "Here you go!" as she slides a latte across the counter.

Keep lines under 7 words for clean lip-sync. One sentence max per 8-second clip.

2. Sound Effects — Describe specific sounds:

The sound of a match striking, then a candle flame flickering to life.

3. Ambient Sound — Set the sonic environment:

Birds chirping in the background, distant traffic hum, morning atmosphere.

Audio Tips

Be specific: "the crunch of gravel underfoot" beats "footstep sounds"
Layer audio: combine ambient + specific sounds for depth
Match audio to motion: "a door creaks open" timed with the visual action
Dialogue should be short — 1-2 sentences max for 8 seconds

Step 4: Craft the Full Prompt

Prompt Structure (5-Element Priority)

Structure prompts in this order of priority — you don't need all five every time:

Shot Specification — camera work, framing, movement
Setting & Atmosphere — location, time, weather, lighting
Subject & Action — who/what, described in beats
Audio Layer — dialogue, SFX, ambient (separate sentences)
Style/Grade — artistic treatment, lens, color

[Shot type + camera movement]. [Setting and lighting]. [Subject doing action].
[Audio: what you hear]. [Style/grade].

Example Prompts

Podcast promo (16:9):

A close-up tracking shot of a vintage microphone in a warmly lit podcast studio.
Steam rises slowly from a coffee mug beside it. Morning sunlight filters through
blinds, casting soft stripes across the desk. The sound of a quiet room — a clock
ticking, the faint hum of equipment. Cinematic, intimate, inviting.

Social teaser — vertical (9:16):

A hand reaches into frame and opens a leather-bound journal on a wooden desk.
The pages flutter briefly before settling on a page covered in handwritten notes.
A pen is set down beside the book. The sound of pages rustling, a pen clicking,
and soft ambient music. Warm overhead lighting, shallow depth of field.

Newsletter header — ambient loop (16:9):

A static wide shot of rain falling on a large window. Behind the glass, a blurred
cityscape with warm lights. Water droplets slide slowly down the pane. The sound of
steady rain and distant muffled city noise. Moody, contemplative, cozy.

Product reveal (16:9):

Camera slowly orbits a pair of wireless headphones placed on a dark marble surface.
Dramatic studio lighting with a single warm key light from the left. The headphones
cast a sharp shadow. Subtle electronic ambient music. Premium, minimal, modern.

Step 5: Generate via API

Running the Script

# VEO: Basic generation (8s, 720p, 16:9)
python scripts/generate_video.py "Your prompt here"

# VEO: Fast draft for iteration
python scripts/generate_video.py "Your prompt here" --model fast

# VEO: High quality vertical video for social
python scripts/generate_video.py "Your prompt" --aspect 9:16 --resolution 1080p

# VEO: Multiple variations to choose from
python scripts/generate_video.py "Your prompt" --count 2 --output ./videos

# VEO: Short clip with specific settings
python scripts/generate_video.py "Your prompt" --duration 4 --resolution 4k --name "hero-clip"

# VEO: Exclude unwanted elements
python scripts/generate_video.py "Your prompt" --negative "text overlays, watermarks, blurry"

# Sora: Basic generation
python scripts/generate_video.py "A cat on a windowsill, warm light" --provider sora

# Sora: 12-second clip (longer than VEO allows)
python scripts/generate_video.py "A dog running through a meadow" --provider sora --duration 12

# Sora: Pro model, vertical
python scripts/generate_video.py "Latte art being poured" --provider sora --model sora-2-pro --aspect 9:16

# Sora: Multiple variations
python scripts/generate_video.py "Ocean waves at sunset" --provider sora --count 2 --output ./videos

# Kling: Text-to-video (v3 Pro, 5s default)
python scripts/generate_video.py "A golden retriever running through a field at sunset" --provider kling

# Kling: Longer clip with audio
python scripts/generate_video.py "Ocean waves crashing on rocks" --provider kling --duration 10 --audio

# Kling: Image-to-video (animate a still image)
python scripts/generate_video.py "Character slowly turns to camera and smiles" --provider kling --input photo.jpg

# Kling: Image-to-video from URL
python scripts/generate_video.py "Zoom slowly into the scene" --provider kling --input https://example.com/image.jpg

# Kling: Budget model, square format
python scripts/generate_video.py "Abstract patterns flowing" --provider kling --model kling-v3-std --aspect 1:1

# Kling: Vertical for Reels/TikTok
python scripts/generate_video.py "Coffee being poured" --provider kling --aspect 9:16 --duration 5

Options:

Flag	Values	Default	Notes
`--provider`	`veo`, `sora`, `kling`	`veo`	VEO for audio, Sora for visuals, Kling for image-to-video
`--model`	VEO: `standard`/`fast`, Sora: `sora-2`/`sora-2-pro`, Kling: `kling-v3-pro`/`kling-v3-std`/`kling-v2`/`kling-v1.5`	provider default	Provider-specific models
`--aspect`	`16:9`, `9:16`, `1:1`	`16:9`	1:1 Kling only
`--resolution`	`720p`, `1080p`, `4k`	`720p`	VEO only
`--duration`	VEO: 4/6/8, Sora: 4/8/12, Kling v3: 3-15, Kling v2: 5/10	8 (VEO/Sora), 5 (Kling)
`--negative`	text	none	VEO + Kling (ignored by Sora)
`--input`	path or URL	none	Reference image for image-to-video (VEO, Kling)
`--audio`	flag	off	Enable native audio (Kling v3 only)
`--count`	`1`-`4`	`1`	Generate variations
`--output`	path	`.`	Save directory
`--name`	text	none	Filename prefix

Output: MP4 files with timestamp-based filenames.

Step 6: Iterate

After reviewing generated video:

Motion wrong? Be more explicit about direction, speed, and sequence
Audio off? Add or refine audio cues — the model needs clear direction
Too much happening? Simplify. One clear action per clip works best
Style drift? Add a negative prompt to exclude unwanted aesthetics
Wrong mood? Adjust lighting and atmosphere descriptors

Iteration Strategy

Start with --model fast and --duration 4 for quick drafts
Refine the prompt through 2-3 fast iterations
Switch to --model standard with full duration/resolution for the final take
Generate 2 variations of the final prompt and pick the best

Negative Prompt Guide

Use --negative to steer away from common problems:

Problem	Negative Prompt
Text/watermarks appearing	"text, watermarks, logos, subtitles"
Uncanny faces	"distorted faces, morphing features"
Jittery motion	"jerky motion, flickering, stuttering"
Over-saturated look	"oversaturated, HDR, neon colors"
Stock footage feel	"generic, corporate, stock footage aesthetic"

Prompting Principles

Think in Shots, Not Scenes

8 seconds is one shot. Don't try to cram a narrative arc — describe a single continuous moment.

Don't	Do
"A chef makes a meal from scratch and serves it"	"A chef's hands julienne carrots on a wooden cutting board, knife moving rhythmically"
"A day at the beach from sunrise to sunset"	"Waves gently lap at bare feet on sand, golden hour light, camera at ground level"

Be Specific About Motion

Vague motion descriptions produce vague results. Describe what moves, how fast, and in which direction.

Layer Your Audio

Don't just describe one sound — create a soundscape:

The crackling of a vinyl record playing soft jazz,
a distant car horn outside the window,
the quiet clink of an ice cube in a glass.

Use Negative Prompts Proactively

Always include --negative "text, watermarks" at minimum. The model occasionally generates unwanted text overlays.

Use Cases by Content Type

Social Media (9:16, 4-8s)

Short, punchy, loop-friendly. Favor close-ups and strong motion.

python scripts/generate_video.py "Close-up of coffee being poured into a ceramic mug, steam rising, warm morning light. The sound of liquid pouring and a soft sigh." \
  --aspect 9:16 --duration 4 --resolution 1080p

Podcast/Newsletter Headers (16:9, 8s)

Ambient, atmospheric. Favor wide shots and subtle motion.

python scripts/generate_video.py "A vintage radio on a wooden shelf, dial slowly turning. Warm tungsten light. Soft static transitioning into faint music." \
  --resolution 1080p --name "podcast-header"

Product/Brand (16:9, 6-8s)

Clean, controlled, premium feel. Studio lighting, slow orbits.

python scripts/generate_video.py "Camera slowly orbits a leather notebook on a dark wood desk. Single warm key light. The sound of pages turning gently." \
  --resolution 4k --duration 6 --negative "text, watermarks, busy background"

Multi-Clip Consistency

When generating multiple clips for a project (e.g. a social series, product launch, or multi-shot sequence):

Lock Your Constants

Create a consistency block and repeat it verbatim across all prompts:

CHARACTER: A woman in her thirties with short silver hair and a black turtleneck
PALETTE: amber, cream, walnut brown, deep olive
LIGHTING: Soft key light from camera right, warm tungsten
STYLE: Cinematic, shallow depth of field, warm film grain
NEGATIVE: no subtitles, no on-screen text, no watermarks

Frame Chaining

For sequential shots, use the last frame of clip N as the reference image for clip N+1. This preserves subject orientation, lighting continuity, and motion vectors.

Consistency Checklist

Same character description, word for word — never paraphrase between shots
Same palette anchors (3-5 named colors)
Same lighting direction and quality
Same aspect ratio and resolution
Same style/grade language
"No subtitles, no on-screen text" included
Simple wardrobe — solid colors and notable anchors (red jacket, silver pendant) are more consistent than busy patterns

Related Skills

nano-banana-image-generator — Static AI images (Gemini). Generate stills to animate with Kling's image-to-video.
youtube-title-creator — Pair video content with optimized titles
text-on-broll — Combine AI video with on-screen text (Remotion)

Prompt Engineering References

Two references available:

references/ai-video-prompt-engineering-guide.md — Comprehensive synthesis from 7 sources (Captions.ai, LTX Studio, Google Veo, Adobe Firefly, getimg.ai, community threads). Covers prompt anatomy, camera/lighting/audio language, reusable templates, common mistakes, use-case playbooks, and a quick-reference cheat sheet.
references/prompt-engineering-research.md — Original VEO-focused research with testing results.

Load the guide when you need to craft a prompt and want the full reference. The 30-second checklist from the guide:

WHO/WHAT is the subject? (appearance details)
What are they DOING? (dynamic verb + pacing)
WHERE? (setting + time + weather)
CAMERA? (shot type + movement + lens)
LIGHT? (source + quality + color)
MOOD/STYLE? (genre + color grade)
What do we HEAR? (music + ambience + dialogue + SFX)
FORMAT? (aspect ratio + duration + platform)

video-generator