gemini-3-multimodal
Gemini 3 Pro Multimodal Input Processing
Comprehensive guide for processing multimodal inputs with Gemini 3 Pro, including image understanding, video analysis, audio processing, and PDF document extraction. This skill focuses on INPUT processing (analyzing media) - see gemini-3-image-generation for OUTPUT (generating images).
Overview
Gemini 3 Pro provides native multimodal capabilities for understanding and analyzing various media types. This skill covers all input processing operations with granular control over quality, performance, and token consumption.
Key Capabilities
- Image Understanding: Object detection, OCR, visual Q&A, code from screenshots
- Video Processing: Up to 1 hour of video, frame analysis, OCR
- Audio Processing: Up to 9.5 hours of audio, speech understanding
- PDF Documents: Native PDF support, multi-page analysis, text extraction
- Media Resolution Control: Low/medium/high resolution for token optimization
- Token Optimization: Granular control over processing costs
When to Use This Skill
- Analyzing images, photos, or screenshots
- Processing video content for insights
- Transcribing or understanding audio/speech
- Extracting information from PDF documents
- Building multimodal applications
- Optimizing media processing costs
Quick Start
Prerequisites
- Gemini API setup (see
gemini-3-pro-apiskill) - Media files in supported formats
Python Quick Start
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3-pro-preview")
# Upload and analyze image
image_file = genai.upload_file(Path("photo.jpg"))
response = model.generate_content([
"What's in this image?",
image_file
])
print(response.text)
Node.js Quick Start
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";
import fs from "fs";
const genAI = new GoogleGenerativeAI("YOUR_API_KEY");
const fileManager = new GoogleAIFileManager("YOUR_API_KEY");
// Upload and analyze image
const uploadResult = await fileManager.uploadFile("photo.jpg", {
mimeType: "image/jpeg"
});
const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });
const result = await model.generateContent([
"What's in this image?",
{ fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);
console.log(result.response.text());
Core Tasks
Task 1: Analyze Image Content
Goal: Extract information, objects, text, or insights from images.
Use Cases:
- Object detection and recognition
- OCR (text extraction from images)
- Visual Q&A
- Code generation from UI screenshots
- Chart/diagram analysis
- Product identification
Python Example:
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
# Configure model with high resolution for best quality
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"thinking_level": "high",
"media_resolution": "high" # 1,120 tokens per image
}
)
# Upload image
image_path = Path("screenshot.png")
image_file = genai.upload_file(image_path)
# Analyze with specific prompt
response = model.generate_content([
"""Analyze this image and provide:
1. Main objects and their locations
2. Any visible text (OCR)
3. Overall context and purpose
4. If code/UI: describe the functionality
""",
image_file
])
print(response.text)
# Check token usage
print(f"Tokens used: {response.usage_metadata.total_token_count}")
Node.js Example:
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);
// Upload image
const uploadResult = await fileManager.uploadFile("screenshot.png", {
mimeType: "image/png"
});
// Configure model with high resolution
const model = genAI.getGenerativeModel({
model: "gemini-3-pro-preview",
generationConfig: {
thinking_level: "high",
media_resolution: "high" // Best quality for OCR
}
});
const result = await model.generateContent([
`Analyze this image and provide:
1. Main objects and their locations
2. Any visible text (OCR)
3. Overall context and purpose`,
{ fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);
console.log(result.response.text());
Resolution Options:
| Resolution | Tokens per Image | Best For |
|---|---|---|
low |
280 tokens | Quick analysis, low detail |
medium |
560 tokens | Balanced quality/cost |
high |
1,120 tokens | OCR, fine details, small text |
Supported Formats: JPEG, PNG, WEBP, HEIC, HEIF
See: references/image-understanding.md for advanced patterns
Task 2: Process Video Content
Goal: Analyze video content, extract insights, perform frame-by-frame analysis.
Use Cases:
- Video summarization
- Object tracking
- Scene detection
- Video OCR
- Content moderation
- Educational video analysis
Python Example:
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
# Configure for video processing
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"thinking_level": "high",
"media_resolution": "medium" # 70 tokens/frame (balanced)
}
)
# Upload video (up to 1 hour supported)
video_path = Path("tutorial.mp4")
video_file = genai.upload_file(video_path)
# Wait for processing
import time
while video_file.state.name == "PROCESSING":
time.sleep(5)
video_file = genai.get_file(video_file.name)
if video_file.state.name == "FAILED":
raise ValueError("Video processing failed")
# Analyze video
response = model.generate_content([
"""Analyze this video and provide:
1. Overall summary of content
2. Key scenes and timestamps
3. Main topics covered
4. Any visible text throughout the video
""",
video_file
])
print(response.text)
print(f"Tokens used: {response.usage_metadata.total_token_count}")
Node.js Example:
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);
// Upload video
const uploadResult = await fileManager.uploadFile("tutorial.mp4", {
mimeType: "video/mp4"
});
// Wait for processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
await new Promise(resolve => setTimeout(resolve, 5000));
file = await fileManager.getFile(uploadResult.file.name);
}
if (file.state === FileState.FAILED) {
throw new Error("Video processing failed");
}
// Analyze video
const model = genAI.getGenerativeModel({
model: "gemini-3-pro-preview",
generationConfig: {
media_resolution: "medium"
}
});
const result = await model.generateContent([
`Analyze this video and provide:
1. Overall summary
2. Key scenes and timestamps
3. Main topics covered`,
{ fileData: { fileUri: file.uri, mimeType: file.mimeType } }
]);
console.log(result.response.text());
Video Specs:
- Max Duration: 1 hour
- Formats: MP4, MOV, AVI, etc.
- Resolution Options: Low (70 tokens/frame), Medium (70 tokens/frame), High (280 tokens/frame)
- OCR: Available with high resolution
See: references/video-processing.md for advanced patterns
Task 3: Process Audio/Speech
Goal: Transcribe and understand audio content, process speech.
Use Cases:
- Audio transcription
- Speech analysis
- Podcast summarization
- Meeting notes
- Language understanding
- Audio classification
Python Example:
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3-pro-preview")
# Upload audio file (up to 9.5 hours supported)
audio_path = Path("podcast.mp3")
audio_file = genai.upload_file(audio_path)
# Wait for processing
import time
while audio_file.state.name == "PROCESSING":
time.sleep(5)
audio_file = genai.get_file(audio_file.name)
# Process audio
response = model.generate_content([
"""Process this audio and provide:
1. Full transcription
2. Summary of main points
3. Key speakers (if multiple)
4. Important timestamps
5. Action items or conclusions
""",
audio_file
])
print(response.text)
print(f"Tokens used: {response.usage_metadata.total_token_count}")
Node.js Example:
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);
// Upload audio
const uploadResult = await fileManager.uploadFile("podcast.mp3", {
mimeType: "audio/mp3"
});
// Wait for processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
await new Promise(resolve => setTimeout(resolve, 5000));
file = await fileManager.getFile(uploadResult.file.name);
}
const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });
const result = await model.generateContent([
`Process this audio and provide:
1. Full transcription
2. Summary of main points
3. Key timestamps`,
{ fileData: { fileUri: file.uri, mimeType: file.mimeType } }
]);
console.log(result.response.text());
Audio Specs:
- Max Duration: 9.5 hours
- Formats: WAV, MP3, FLAC, AAC, etc.
- Languages: Supports multiple languages
See: references/audio-processing.md for advanced patterns
Task 4: Process PDF Documents
Goal: Extract and analyze content from PDF documents.
Use Cases:
- Document analysis
- Information extraction
- Form processing
- Research paper analysis
- Contract review
- Multi-page document understanding
Python Example:
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
# Configure with medium resolution (recommended for PDFs)
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"thinking_level": "high",
"media_resolution": "medium" # 560 tokens/page (saturation point)
}
)
# Upload PDF
pdf_path = Path("research_paper.pdf")
pdf_file = genai.upload_file(pdf_path)
# Wait for processing
import time
while pdf_file.state.name == "PROCESSING":
time.sleep(5)
pdf_file = genai.get_file(pdf_file.name)
# Analyze PDF
response = model.generate_content([
"""Analyze this PDF document and provide:
1. Document type and purpose
2. Main sections and structure
3. Key findings or arguments
4. Important data or statistics
5. Conclusions or recommendations
""",
pdf_file
])
print(response.text)
print(f"Tokens used: {response.usage_metadata.total_token_count}")
Node.js Example:
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);
// Upload PDF
const uploadResult = await fileManager.uploadFile("research_paper.pdf", {
mimeType: "application/pdf"
});
// Wait for processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
await new Promise(resolve => setTimeout(resolve, 5000));
file = await fileManager.getFile(uploadResult.file.name);
}
// Analyze with medium resolution (recommended)
const model = genAI.getGenerativeModel({
model: "gemini-3-pro-preview",
generationConfig: {
media_resolution: "medium"
}
});
const result = await model.generateContent([
`Analyze this PDF and extract:
1. Main sections
2. Key findings
3. Important data`,
{ fileData: { fileUri: file.uri, mimeType: file.mimeType } }
]);
console.log(result.response.text());
PDF Processing Tips:
- Recommended Resolution:
medium(560 tokens/page) - saturation point for quality - Multi-page: Automatically processes all pages
- Native Support: No conversion to images needed
- Text Extraction: High-quality text extraction built-in
See: references/document-processing.md for advanced patterns
Task 5: Optimize Media Processing Costs
Goal: Balance quality and token consumption based on use case.
Strategy:
| Media Type | Resolution | Tokens | Use Case |
|---|---|---|---|
| Images | low |
280 | Quick scan, thumbnails |
| Images | medium |
560 | General analysis |
| Images | high |
1,120 | OCR, fine details, code |
| PDFs | medium |
560/page | Recommended (saturation point) |
| PDFs | high |
1,120/page | Diminishing returns |
| Video | low/medium |
70/frame | Most use cases |
| Video | high |
280/frame | OCR from video |
Python Optimization Example:
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
# Different resolutions for different use cases
def analyze_image_optimized(image_path, need_ocr=False):
"""Analyze image with appropriate resolution"""
resolution = "high" if need_ocr else "medium"
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"media_resolution": resolution
}
)
image_file = genai.upload_file(image_path)
response = model.generate_content([
"Describe this image" if not need_ocr else "Extract all text from this image",
image_file
])
# Log token usage for cost tracking
tokens = response.usage_metadata.total_token_count
cost = (tokens / 1_000_000) * 2.00 # Input pricing
print(f"Resolution: {resolution}, Tokens: {tokens}, Cost: ${cost:.6f}")
return response.text
# Use appropriate resolution
analyze_image_optimized("photo.jpg", need_ocr=False) # medium
analyze_image_optimized("document.png", need_ocr=True) # high
Per-Item Resolution Control:
# Set different resolutions for different media in same request
response = model.generate_content([
"Compare these images",
{"file": image1, "media_resolution": "high"}, # High detail
{"file": image2, "media_resolution": "low"}, # Low detail OK
])
Cost Monitoring:
def log_media_costs(response):
"""Log media processing costs"""
usage = response.usage_metadata
# Pricing for ≤200k context
input_cost = (usage.prompt_token_count / 1_000_000) * 2.00
output_cost = (usage.candidates_token_count / 1_000_000) * 12.00
print(f"Input tokens: {usage.prompt_token_count} (${input_cost:.6f})")
print(f"Output tokens: {usage.candidates_token_count} (${output_cost:.6f})")
print(f"Total cost: ${input_cost + output_cost:.6f}")
See: references/token-optimization.md for comprehensive strategies
Media Resolution Control
Resolution Options
| Setting | Images | PDFs | Video (per frame) | Recommendation |
|---|---|---|---|---|
low |
280 tokens | 280 tokens | 70 tokens | Quick analysis, low detail |
medium |
560 tokens | 560 tokens | 70 tokens | Balanced quality/cost |
high |
1,120 tokens | 1,120 tokens | 280 tokens | OCR, fine text, details |
Configuration
Global Setting (all media):
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"media_resolution": "high" # Applies to all media
}
)
Per-Item Setting (mixed resolutions):
response = model.generate_content([
"Analyze these files",
{"file": high_detail_image, "media_resolution": "high"},
{"file": low_detail_image, "media_resolution": "low"}
])
Best Practices
- Images: Use
highfor OCR/text extraction,mediumfor general analysis - PDFs: Use
medium(saturation point - higher resolutions show diminishing returns) - Video: Use
lowormediumunless OCR needed - Cost Control: Start with
low, increase only if quality insufficient
See: references/media-resolution.md for detailed guide
File Management
Upload Files
import google.generativeai as genai
# Upload file
file = genai.upload_file("path/to/file.jpg")
print(f"Uploaded: {file.name}")
# Check processing status
while file.state.name == "PROCESSING":
time.sleep(5)
file = genai.get_file(file.name)
print(f"Status: {file.state.name}")
List Uploaded Files
# List all files
for file in genai.list_files():
print(f"{file.name} - {file.display_name}")
Delete Files
# Delete specific file
genai.delete_file(file.name)
# Delete all files
for file in genai.list_files():
genai.delete_file(file.name)
print(f"Deleted: {file.name}")
File Lifecycle
- Upload: Immediate
- Processing: Async (especially for video/audio)
- Storage: Files persist until deleted
- Expiration: Files may expire after period (check docs)
Multi-File Processing
Process Multiple Images
# Upload multiple images
images = [
genai.upload_file("photo1.jpg"),
genai.upload_file("photo2.jpg"),
genai.upload_file("photo3.jpg")
]
# Analyze together
response = model.generate_content([
"Compare these images and identify common elements",
*images
])
print(response.text)
Mixed Media Types
# Combine different media types
image = genai.upload_file("chart.png")
pdf = genai.upload_file("report.pdf")
response = model.generate_content([
"Does the chart match the data in the report?",
image,
pdf
])
References
Core Guides
- Image Understanding - Complete image analysis patterns
- Video Processing - Video analysis and frame extraction
- Audio Processing - Audio transcription and analysis
- Document Processing - PDF and document extraction
Optimization
- Media Resolution - Resolution control and quality tuning
- OCR Extraction - Text extraction best practices
- Token Optimization - Cost control and efficiency
Scripts
- Analyze Image Script - Production-ready image analysis
- Process Video Script - Video processing automation
- Process Audio Script - Audio transcription
- Process PDF Script - PDF extraction
Official Resources
Related Skills
- gemini-3-pro-api - Basic setup, authentication, text generation
- gemini-3-image-generation - Image OUTPUT (generating images)
- gemini-3-advanced - Function calling, tools, caching, batch processing
Common Use Cases
Visual Q&A Application
Combine image understanding with chat:
model = genai.GenerativeModel("gemini-3-pro-preview")
chat = model.start_chat()
# Upload image
image = genai.upload_file("product.jpg")
# Ask questions about it
response1 = chat.send_message(["What product is this?", image])
response2 = chat.send_message("What are its main features?")
response3 = chat.send_message("What's the price range for similar products?")
Document Analysis Pipeline
Process multiple PDFs and extract insights:
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={"media_resolution": "medium"}
)
# Process all PDFs in directory
pdf_dir = Path("documents/")
results = {}
for pdf_path in pdf_dir.glob("*.pdf"):
pdf_file = genai.upload_file(pdf_path)
# Wait for processing
while pdf_file.state.name == "PROCESSING":
time.sleep(5)
pdf_file = genai.get_file(pdf_file.name)
# Extract key information
response = model.generate_content([
"Extract: 1) Document type, 2) Key dates, 3) Important numbers, 4) Summary",
pdf_file
])
results[pdf_path.name] = response.text
# Clean up
genai.delete_file(pdf_file.name)
# Save results
import json
with open("analysis_results.json", "w") as f:
json.dump(results, f, indent=2)
Video Content Moderation
Analyze video for specific content:
video = genai.upload_file("user_upload.mp4")
# Wait for processing
while video.state.name == "PROCESSING":
time.sleep(10)
video = genai.get_file(video.name)
response = model.generate_content([
"""Analyze this video for:
1. Inappropriate content (yes/no)
2. Violence or harmful content (yes/no)
3. Overall content rating (G/PG/PG-13/R)
4. Brief justification
Provide structured response.
""",
video
])
print(response.text)
Troubleshooting
Issue: File processing stuck at "PROCESSING"
Solution: Large files (especially video) can take time. Wait 30-60 seconds between checks. If stuck > 5 minutes, file may have failed.
Issue: Low quality OCR results
Solution: Use media_resolution: "high" for images with text. Ensure image is clear and high resolution.
Issue: High token costs
Solution: Use appropriate media resolution. Start with low, increase only if needed. For PDFs, medium is usually sufficient.
Issue: Video analysis missing details
Solution: Use media_resolution: "high" for better frame analysis, or provide more specific prompts about what to look for.
Issue: Audio transcription inaccurate
Solution: Ensure audio quality is good (no excessive background noise). Provide context in prompt about accent, language, or domain.
Summary
This skill provides comprehensive multimodal input processing capabilities:
✅ Image analysis with OCR and object detection ✅ Video processing up to 1 hour ✅ Audio transcription up to 9.5 hours ✅ Native PDF document processing ✅ Granular media resolution control ✅ Token optimization strategies ✅ Multi-file processing ✅ Production-ready examples
Ready to analyze multimodal content? Start with the task that matches your use case above!