grok-media
Installation
SKILL.md
Grok Media
Use this skill when working with xAI media models in OpenMontage.
Models
grok-imagine-imagefor image generation and image editinggrok-imagine-videofor text-to-video, image-to-video, and reference-image video
Authentication
- Env var:
XAI_API_KEY - Base URL:
https://api.x.ai/v1 - Header:
Authorization: Bearer $XAI_API_KEY
Image API
Text-to-image
- Endpoint:
POST /images/generations - Core fields:
modelpromptnaspect_ratioresolution
Image edit
- Endpoint:
POST /images/edits - Use
imagefor one source image - Use
imagesfor multi-image compositing - Each source image can be:
- a public HTTPS URL
- a base64 data URI
Image prompting
- Grok responds well to direct natural language
- For edits, describe only the intended change and preserve everything else implicitly
- For multi-image merges, explicitly name how each source contributes
- Prefer one strong scene description over long style-stacking
Video API
Generation
- Endpoint:
POST /videos/generations - Polling endpoint:
GET /videos/{request_id} - Success state:
status == "done" - Failure states to handle explicitly:
failed,expired
Modes
- Text-to-video:
- prompt-only generation
- Image-to-video:
- use
image: {"url": ...} - this anchors the starting frame
- use
- Reference-to-video:
- use
reference_images: [{"url": ...}, ...] - this influences who/what appears in the video without locking the first frame
- prompts can reference inputs with placeholders like
<IMAGE_1>,<IMAGE_2>
- use
Video constraints
- Grok video is best treated as short-form generation
- Current output resolutions are
480pand720p - Reference-image video supports multiple images and is useful for product placement, wardrobe transfer, and identity consistency
- Download outputs promptly; provider URLs may be temporary
Pricing
grok-imagine-image:$0.02per generated imagegrok-imagine-imageedits/composites: add$0.002per input imagegrok-imagine-video:480p:$0.05per second720p:$0.07per second
grok-imagine-videoimage-conditioned requests: add$0.002per input image
Grok-Specific Prompt Guidance
Images
- Start with subject, action, setting
- Add one style anchor, not five
- For edits:
- describe the desired modification
- keep the rest of the image stable by omission, not by writing a giant preservation list
Video
- Keep prompts scene-local: one shot, one main motion idea, one emotional beat
- For reference-conditioned video, explicitly map source images to roles:
- person from
<IMAGE_1> - jacket from
<IMAGE_2> - product from
<IMAGE_3>
- person from
- Camera and pacing language helps:
- slow push-in
- handheld follow
- locked-off medium shot
- high-energy whip pan transition
Good Fits
- Image style transfer
- Image compositing from multiple sources
- Reference-conditioned short video
- Product-led motion clips
- Character-consistent scenes without hard first-frame lock
Weak Fits
- Long-form clip generation
- Heavy reliance on deterministic seeds
- Overloaded prompts with multiple scene changes
Failure Handling
- If generation submission succeeds but polling expires, surface it as a provider/runtime issue
- If a request fails, preserve the endpoint, mode, and prompt summary in the error
- Do not silently substitute a different provider after xAI was selected without user approval
Related skills
More from calesthio/openmontage
video-edit
|
28video-download
|
26text-to-speech
|
26ffmpeg
Video and audio processing with FFmpeg. Use for format conversion, resizing, compression, audio extraction, and preparing assets for Remotion. Triggers include converting GIF to MP4, resizing video, extracting audio, compressing files, or any media transformation task.
24video-translate
|
24acestep
AI music generation with ACE-Step 1.5 — background music, vocal tracks, covers, stem extraction for video production. Use when generating music, soundtracks, jingles, or working with audio stems. Triggers include background music, soundtrack, jingle, music generation, stem extraction, cover, style transfer, or musical composition tasks.
24