venice-audio-transcription
Venice Transcription (/audio/transcriptions)
POST /api/v1/audio/transcriptions takes an audio file and returns text. It's OpenAI-compatible with multipart/form-data — the OpenAI SDK's audio.transcriptions.create() works unchanged.
Use when
- You need STT (speech-to-text) for voice notes, meetings, podcasts, short audio.
- You need timestamps for subtitles / chapters.
- You want to pick between fast local-style models (Parakeet) and large multilingual ones (Whisper, Wizper, Scribe).
For long video / YouTube transcription, see venice-video's /video/transcriptions (takes a public video URL directly).
Minimal request
curl https://api.venice.ai/api/v1/audio/transcriptions \
-H "Authorization: Bearer $VENICE_API_KEY" \
-F "file=@./meeting.m4a" \
-F "model=nvidia/parakeet-tdt-0.6b-v3" \
-F "response_format=json" \
-F "timestamps=false"
{ "text": "Alright everyone, let's kick off the meeting..." }
With timestamps=true, json format also returns segment/word timings (schema is model-specific).
Request (multipart/form-data)
| Field | Type | Default | Notes |
|---|---|---|---|
file |
binary | — | Required. Audio file. Supported: wav, wave, flac, m4a, aac, mp4, mp3, ogg, webm. Base64 is not accepted — upload as a real file. |
model |
enum | nvidia/parakeet-tdt-0.6b-v3 |
See models below. |
response_format |
json / text |
json |
text returns text/plain body. |
timestamps |
bool | false |
Include segment/word timestamps (JSON only). |
language |
string | — | ISO 639-1 hint (e.g. en, ja). Only Whisper-family models honor it; others auto-detect. |
Models
| Model ID | Notes |
|---|---|
nvidia/parakeet-tdt-0.6b-v3 |
Default. Fast, English-first, great for real-time-ish flows. |
openai/whisper-large-v3 |
Large multilingual, honors language hint. |
fal-ai/wizper |
Whisper variant, competitive on quality/latency tradeoff. |
elevenlabs/scribe-v2 |
ElevenLabs Scribe, strong on noisy audio. |
stt-xai-v1 |
xAI Speech-to-Text. |
GET /models?type=asr returns the current catalog. ASR pricing is pricing.per_audio_second.usd — cost scales with audio duration.
OpenAI SDK
import OpenAI from 'openai'
import fs from 'node:fs'
const client = new OpenAI({
apiKey: process.env.VENICE_API_KEY,
baseURL: 'https://api.venice.ai/api/v1',
})
const out = await client.audio.transcriptions.create({
file: fs.createReadStream('meeting.m4a'),
model: 'openai/whisper-large-v3',
response_format: 'json',
language: 'en',
// @ts-expect-error — Venice-specific extra, passes through multipart
timestamps: true,
})
console.log(out.text)
Batch / long files
Venice doesn't expose native chunking. For files > ~30 min, split client-side on silence with ffmpeg or pydub, transcribe each chunk, then concatenate with offset timestamps.
ffmpeg -i long.mp3 -f segment -segment_time 600 -c copy chunk_%03d.mp3
Errors
| Code | Meaning |
|---|---|
400 |
Bad params, unsupported audio format, empty file, or file larger than 25 MB (this endpoint returns 400 with "Maximum size is 25MB", not 413). |
401 |
Auth / Pro-only. |
402 |
Insufficient balance. |
415 |
Wrong Content-Type — must be multipart/form-data. |
422 |
Validation / upstream ASR error (e.g. zero-length audio, upstream provider 422). Not a "content policy" code on this path. |
429 |
Rate limited. |
500 / 503 |
Transient; retry with jitter. |
Gotchas
filemust be uploaded as a real multipart file part. JSON + base64 is not supported here.- Timestamps are only surfaced in the JSON response shapes (
json,verbose_json,srt,vtt). Withresponse_format: textthe handler returns a plaintext/plainbody containing just the transcript — you'll lose any timestamp data, so pickverbose_json/srt/vttwhen you need timings. languageis Whisper-specific. Parakeet / Scribe ignore it and auto-detect.- Peak concurrency limits apply — on
429, back off; big batches should throttle to ~5 parallel requests. - Content-policy rejection on the transcript is returned as
422with an error string; it does not surfacesuggested_prompton this path.
More from veniceai/skills
venice-video
Generate and transcribe videos via Venice. Covers the async /video/quote + /video/queue + /video/retrieve + /video/complete loop, text-to-video, image-to-video, video-to-video (upscale), audio input, reference images, scene and element support, plus /video/transcriptions for YouTube URLs.
28venice-audio-speech
Generate speech from text via POST /audio/speech. Covers TTS models (Kokoro, Qwen 3, xAI, Inworld, Chatterbox, Orpheus, ElevenLabs Turbo, MiniMax, Gemini Flash), voices per family, output formats (mp3/opus/aac/flac/wav/pcm), streaming, prompt/emotion styling, temperature/top_p, and language hints.
28venice-image-generate
Generate images with Venice. Covers POST /image/generate (Venice-native), POST /images/generations (OpenAI-compatible), GET /image/styles (style presets), request fields (prompt, dimensions, cfg_scale, seed, variants, style_preset, aspect_ratio, resolution, safe_mode, watermark), and response formats.
28venice-embeddings
Call POST /embeddings on Venice. Covers request shape (input, model, encoding_format, dimensions, user), OpenAI compatibility, response compression (gzip/br), and practical usage for retrieval, clustering, and RAG.
28venice-errors
Handle Venice API errors correctly. Covers the StandardError / DetailedError / ContentViolationError / X402InferencePaymentRequired body shapes, every meaningful status code (400, 401, 402, 403, 415, 422, 429, 500, 503, 504), the 402 PAYMENT-REQUIRED header used by x402 inference, 422 content-policy suggested_prompt retry pattern, 429 rate-limit headers, and an exponential-backoff retry strategy with idempotency.
27venice-audio-music
Async music / audio-track generation via Venice. Covers the /audio/quote + /audio/queue + /audio/retrieve + /audio/complete lifecycle, lyrics vs instrumental, voice selection, duration, language, speed, model capability probing, and webhook-free polling.
27