Gemini API Development Skill

Source: Official Gemini API documentation scraped 2026-02-27
Coverage: All 81 documentation files

Quick Start

from google import genai  # CRITICAL: NOT google.generativeai

client = genai.Client()  # Uses GEMINI_API_KEY env var automatically

response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Hello world"
)
print(response.text)

JavaScript:

import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({});  // Uses GEMINI_API_KEY env var
const response = await ai.models.generateContent({
    model: "gemini-3-flash-preview",
    contents: "Hello world"
});
console.log(response.text);

CRITICAL GOTCHAS (Read First)

SDK import: from google import genai — NOT google.generativeai (legacy)
Temperature: Default is 1.0 for Gemini 3 — do NOT lower it; causes loops/degraded performance
Thinking params: Gemini 3 uses thinking_level ("low"/"medium"/"high"); Gemini 2.5 uses thinking_budget (integer tokens)
Thought signatures: Gemini 3 REQUIRES thought signatures echoed back during function calling or you get a 400 error. SDKs handle this automatically in chat mode.
API default: SDK defaults to v1beta. Use http_options={'api_version': 'v1alpha'} for experimental features.
REST auth header: x-goog-api-key: $GEMINI_API_KEY (not Authorization Bearer)

Models Reference

Gemini 3 Series (Current)

Model	String	Notes
Gemini 3.1 Pro Preview	`gemini-3.1-pro-preview`	Latest; also `gemini-3.1-pro-preview-customtools` variant
Gemini 3 Flash Preview	`gemini-3-flash-preview`	Default workhorse; shutdown: no date
Gemini 3 Pro Image Preview	`gemini-3-pro-image-preview`	"Nano Banana Pro" — native image gen
Gemini 3.1 Flash Image Preview	`gemini-3.1-flash-image-preview`	"Nano Banana 2" — fast image gen

⚠️ Gemini 3 Pro Preview (gemini-3-pro-preview) shuts down March 9, 2026 → migrate to gemini-3.1-pro-preview

Gemini 2.5 Series (Stable)

Model	String	Shutdown
Gemini 2.5 Pro	`gemini-2.5-pro`	June 17, 2026
Gemini 2.5 Flash	`gemini-2.5-flash`	June 17, 2026
Gemini 2.5 Flash Lite	`gemini-2.5-flash-lite`	July 22, 2026
Gemini 2.5 Flash Image	`gemini-2.5-flash-image`	Oct 2, 2026

Gemini 2.0 Series (Deprecating)

gemini-2.0-flash, gemini-2.0-flash-lite → shutdown June 1, 2026

Specialized Models

TTS: gemini-2.5-flash-preview-tts
Live API: gemini-2.5-flash-native-audio-preview-12-2025
Computer Use: gemini-2.5-computer-use-preview-10-2025, gemini-3-flash-preview
Deep Research: deep-research-pro-preview-12-2025 (via Interactions API only)
Embeddings: gemini-embedding-001
Video (Veo): veo-3.1-generate-preview
Images (Imagen): imagen-4.0-generate-001
Music (Lyria): models/lyria-realtime-exp
Robotics: gemini-robotics-er-1.5-preview
LearnLM: experimental tutor model

Latest Aliases

gemini-pro-latest → gemini-3-pro-preview
gemini-flash-latest → gemini-3-flash-preview

Libraries & Installation

pip install google-genai          # Python
npm install @google/genai          # JavaScript
go get google.golang.org/genai     # Go
# Java: com.google.genai:google-genai:1.0.0
# C#: dotnet add package Google.GenAI

OpenAI compatibility (3 line change):

from openai import OpenAI
client = OpenAI(
    api_key="GEMINI_API_KEY",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
response = client.chat.completions.create(model="gemini-3-flash-preview", messages=[...])

Core Generation

System Instructions

from google.genai import types

response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="User message",
    config=types.GenerateContentConfig(
        system_instruction="You are a helpful assistant.",
        temperature=1.0,
        max_output_tokens=1024,
    )
)

Multi-turn Chat

chat = client.chats.create(model="gemini-3-flash-preview")
response = chat.send_message("Hello")
print(response.text)
response2 = chat.send_message("Tell me more")
print(response2.text)

Streaming

for chunk in client.models.generate_content_stream(
    model="gemini-3-flash-preview",
    contents="Write a long story"
):
    print(chunk.text, end="")

Token Counting

# Before sending:
count = client.models.count_tokens(model="gemini-3-flash-preview", contents=prompt)
print(count.total_tokens)

# After generating:
print(response.usage_metadata)
# Fields: prompt_token_count, candidates_token_count, thoughts_token_count, total_token_count

1 token ≈ 4 characters; 100 tokens ≈ 60-80 English words.

Thinking (Reasoning)

Gemini 3 — `thinking_level`

config=types.GenerateContentConfig(
    thinking_config=types.ThinkingConfig(thinking_level="low")  # "low", "medium", "high"
)

Gemini 2.5 — `thinking_budget`

config=types.GenerateContentConfig(
    thinking_config=types.ThinkingConfig(thinking_budget=1024)  # token budget; 0=disabled
)

Thinking is enabled by default on 2.5 and 3 models — causes higher latency/tokens. Disable if optimizing for speed.

Multimodal Input

Images (Inline — under 20MB)

with open('image.jpg', 'rb') as f:
    image_bytes = f.read()

response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=[
        types.Part.from_bytes(data=image_bytes, mime_type='image/jpeg'),
        "Caption this image."
    ]
)

Images (URL fetch)

import requests
image_bytes = requests.get("https://example.com/image.jpg").content
image = types.Part.from_bytes(data=image_bytes, mime_type="image/jpeg")

PDF Documents (Inline — under 50MB)

import pathlib
filepath = pathlib.Path('file.pdf')
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=[
        types.Part.from_bytes(data=filepath.read_bytes(), mime_type='application/pdf'),
        "Summarize this document"
    ]
)

Audio (via Files API)

myfile = client.files.upload(file="sample.mp3")
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=["Describe this audio", myfile]
)

Audio capabilities: transcription, translation, speaker diarization, emotion detection, timestamps.
For real-time audio → use Live API.

Video (via Files API)

myfile = client.files.upload(file="video.mp4")
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=[myfile, "Summarize this video"]
)

YouTube URLs

# Include YouTube URL directly in contents
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=["https://www.youtube.com/watch?v=XXXXX", "Summarize this video"]
)

File Input Methods Comparison

Method	Max Size	Best For	Persistence
Inline data	100MB (50MB PDF)	Small files, one-off	None
Files API	2GB/file, 20GB/project	Large files, reuse	48 hours
GCS URI	2GB/file, unlimited storage	GCS files	30 days (registration)
External URLs	100MB	Public URLs, AWS/Azure/GCS	None (fetched per request)

Files API

# Upload
myfile = client.files.upload(file="path/to/file.pdf")
print(myfile.uri)  # use in requests

# List
for file in client.files.list():
    print(file.name)

# Delete
client.files.delete(name=myfile.name)

Files API limits: 2GB per file, 20GB per project, 48-hour TTL.

Structured Output

from pydantic import BaseModel

class Recipe(BaseModel):
    name: str
    ingredients: list[str]
    steps: list[str]

response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Give me a chocolate cake recipe",
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=Recipe,
    )
)
import json
recipe = json.loads(response.text)

Tools: Built-in

Google Search (Grounding)

grounding_tool = types.Tool(google_search=types.GoogleSearch())
config = types.GenerateContentConfig(tools=[grounding_tool])
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Who won Euro 2024?",
    config=config
)
# Check response.candidates[0].grounding_metadata for citations

Response includes groundingMetadata with webSearchQueries, groundingChunks, groundingSupports.

URL Context

config = types.GenerateContentConfig(tools=[{"url_context": {}}])
# Include URLs in the prompt text

Google Maps (NOT available with Gemini 3)

# Only for Gemini 2.5 models
config = types.GenerateContentConfig(
    tools=[types.Tool(google_maps=types.GoogleMaps())],
    tool_config=types.ToolConfig(retrieval_config=types.RetrievalConfig(
        lat_lng=types.LatLng(latitude=34.05, longitude=-118.25)
    ))
)

Code Execution

config = types.GenerateContentConfig(
    tools=[types.Tool(code_execution=types.CodeExecution())]
)

Function Calling

def get_weather(location: str) -> dict:
    return {"temp": 72, "condition": "sunny"}

response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="What's the weather in NYC?",
    config=types.GenerateContentConfig(
        tools=[get_weather],
        tool_config=types.ToolConfig(
            function_calling_config=types.FunctionCallingConfig(mode="AUTO")
        )
    )
)

# Check for function calls
for part in response.candidates[0].content.parts:
    if part.function_call:
        result = get_weather(**part.function_call.args)
        # Send result back...

⚠️ Gemini 3 Thought Signatures in Function Calling:
When Gemini 3 returns function calls, each step includes a thoughtSignature. You MUST echo it back exactly — omitting it causes a 400 error. The SDK handles this automatically if you use the chat API or append the full response object to history.

Manual handling pattern:

# After getting FC response, include the full model turn (with signatures) in next request
contents = [
    {"role": "user", "parts": [{"text": "original request"}]},
    model_response.candidates[0].content,  # includes thoughtSignature
    {"role": "user", "parts": [{"function_response": {"name": "fn", "response": result}}]}
]

Embeddings

result = client.models.embed_content(
    model="gemini-embedding-001",
    contents="What is the meaning of life?"
)
print(result.embeddings)

# Batch
result = client.models.embed_content(
    model="gemini-embedding-001",
    contents=["text 1", "text 2", "text 3"]
)

Model: gemini-embedding-001 (GA until July 14, 2026)
Use case: semantic search, RAG, classification, clustering.

File Search (RAG)

Managed RAG — free file storage and free embedding generation at query time. Pay only for initial indexing + model tokens.

# Create store
file_search_store = client.file_search_stores.create(
    config={'display_name': 'my-store'}
)

# Upload directly
operation = client.file_search_stores.upload_to_file_search_store(
    file='document.pdf',
    file_search_store_name=file_search_store.name,
    config={'display_name': 'My Doc'}
)
while not operation.done:
    time.sleep(5)
    operation = client.operations.get(operation)

# Query
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="What does the document say about X?",
    config=types.GenerateContentConfig(
        tools=[types.Tool(
            file_search=types.FileSearch(
                file_search_store_names=[file_search_store.name]
            )
        )]
    )
)

Context Caching

Reduces cost by caching repeated large contexts. Paid tier only.

from google.genai import types

# Create cache
cache = client.caches.create(
    model="gemini-3-flash-preview",
    config=types.CreateCachedContentConfig(
        contents=[large_document_content],
        system_instruction="You are an expert analyst.",
        ttl="3600s",  # 1 hour
        display_name="my-cache"
    )
)

# Use cache
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="What are the key findings?",
    config=types.GenerateContentConfig(cached_content=cache.name)
)
print(response.usage_metadata.cached_content_token_count)

Implicit caching: 2048+ token prefix is automatically cached at 75% discount.
Explicit caching: manual TTL control.

Batch API

50% cost reduction for non-urgent workloads. 24-hour SLO.

# Create batch job (see batch-api.md for full syntax)
batch_job = client.batches.create(
    model="gemini-3-flash-preview",
    src="gs://bucket/requests.jsonl",
    config=types.CreateBatchJobConfig(dest="gs://bucket/responses/")
)

# Poll
while batch_job.state not in ["JOB_STATE_SUCCEEDED", "JOB_STATE_FAILED"]:
    time.sleep(30)
    batch_job = client.batches.get(name=batch_job.name)

Interactions API

Used for agents (Deep Research). Not accessible via generate_content.

import time

interaction = client.interactions.create(
    input="Research the history of quantum computing",
    agent='deep-research-pro-preview-12-2025',
    background=True  # REQUIRED for long tasks
)

while True:
    interaction = client.interactions.get(interaction.id)
    if interaction.status == "completed":
        print(interaction.outputs[-1].text)
        break
    elif interaction.status == "failed":
        break
    time.sleep(10)

For combining with your own data:

interaction = client.interactions.create(
    input="Compare our Q4 report to public benchmarks",
    agent="deep-research-pro-preview-12-2025",
    background=True,
    tools=[{"type": "file_search", "file_search_store_names": ["fileSearchStores/my-store"]}]
)

Live API (Real-time Voice/Video)

For interactive, streaming audio/video sessions via WebSocket.

import asyncio
from google import genai

client = genai.Client()
model = "gemini-2.5-flash-native-audio-preview-12-2025"

async def main():
    async with client.aio.live.connect(
        model=model,
        config={"response_modalities": ["AUDIO"]}
    ) as session:
        await session.send_client_content(
            turns="Hello, how are you?",
            turn_complete=True
        )
        async for response in session.receive():
            if response.data:
                # raw PCM audio bytes (24kHz, 16-bit, little-endian)
                pass
            if response.server_content and response.server_content.turn_complete:
                break

asyncio.run(main())

Audio format: raw PCM, little-endian, 16-bit. Output: 24kHz. Input: natively 16kHz (resampled if different).

Session limits:

Audio-only: 15 min (without compression)
Audio+video: 2 min
Connection: ~10 min → use Session Resumption

Session Resumption:

config=types.LiveConnectConfig(
    session_resumption=types.SessionResumptionConfig(handle=previous_handle)
)
# Save new handle from session_resumption_update messages

Context window compression (for long sessions):

config=types.LiveConnectConfig(
    context_window_compression=types.ContextWindowCompressionConfig(
        sliding_window=types.SlidingWindow()
    )
)

Tools in Live API: Google Search ✅, Function calling ✅, Google Maps ❌, Code execution ❌, URL context ❌

Live API function calling (manual tool response required):

# After receiving tool_call in response:
await session.send_tool_response(function_responses=[
    types.FunctionResponse(id=fc.id, name=fc.name, response={"result": "ok"})
    for fc in response.tool_call.function_calls
])

Ephemeral Tokens (Live API Security)

For client-side Live API connections. Short-lived tokens that expire, reducing risk vs. exposing API keys.

import datetime
now = datetime.datetime.now(tz=datetime.timezone.utc)

client = genai.Client(http_options={'api_version': 'v1alpha'})
token = client.auth_tokens.create(config={
    'uses': 1,
    'expire_time': now + datetime.timedelta(minutes=30),
    'new_session_expire_time': now + datetime.timedelta(minutes=1),
    'http_options': {'api_version': 'v1alpha'},
})
# Send token.name to client; use as API key for Live API only

Can lock token to specific config:

'live_connect_constraints': {
    'model': 'gemini-2.5-flash-native-audio-preview-12-2025',
    'config': {'response_modalities': ['AUDIO']}
}

Image Generation (Nano Banana)

Nano Banana 2 = gemini-3.1-flash-image-preview — fast/high-volume
Nano Banana Pro = gemini-3-pro-image-preview — pro quality, thinking
Nano Banana = gemini-2.5-flash-image — speed/efficiency

from PIL import Image

response = client.models.generate_content(
    model="gemini-3.1-flash-image-preview",
    contents="Create a picture of a tropical beach at sunset"
)

for part in response.parts:
    if part.text:
        print(part.text)
    elif part.inline_data:
        image = part.as_image()
        image.save("output.png")

All generated images include SynthID watermark.

Video Generation (Veo 3.1)

operation = client.models.generate_videos(
    model="veo-3.1-generate-preview",
    prompt="A serene mountain lake at dawn"
)

while not operation.done:
    time.sleep(10)
    operation = client.operations.get(operation)

video = operation.response.generated_videos[0]
client.files.download(file=video.video)
video.video.save("output.mp4")

Capabilities: 8-second 720p/1080p/4K, portrait (9:16) or landscape (16:9), audio, video extension, first/last frame specification, up to 3 reference images.

Image Generation (Imagen — Standalone)

response = client.models.generate_images(
    model='imagen-4.0-generate-001',
    prompt='Robot holding a red skateboard',
    config=types.GenerateImagesConfig(number_of_images=4)
)
for gen_image in response.generated_images:
    gen_image.image.show()

TTS (Text-to-Speech)

import wave

response = client.models.generate_content(
    model="gemini-2.5-flash-preview-tts",
    contents="Say cheerfully: Have a wonderful day!",
    config=types.GenerateContentConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name='Kore')
            )
        )
    )
)

data = response.candidates[0].content.parts[0].inline_data.data
with wave.open("out.wav", "wb") as wf:
    wf.setnchannels(1)
    wf.setsampwidth(2)
    wf.setframerate(24000)
    wf.writeframes(data)

Multi-speaker TTS (up to 2 speakers):

config=types.GenerateContentConfig(
    response_modalities=["AUDIO"],
    speech_config=types.SpeechConfig(
        multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
            speaker_voice_configs=[
                types.SpeakerVoiceConfig(
                    speaker='Joe',
                    voice_config=types.VoiceConfig(
                        prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name='Kore')
                    )
                ),
                types.SpeakerVoiceConfig(
                    speaker='Jane',
                    voice_config=types.VoiceConfig(
                        prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name='Aoede')
                    )
                ),
            ]
        )
    )
)

Prompt must name speakers matching the config: "TTS the conversation between Joe and Jane: Joe: ... Jane: ..."

Music Generation (Lyria RealTime)

Experimental. Real-time streaming music via WebSocket.

client = genai.Client(http_options={'api_version': 'v1alpha'})

async with client.aio.live.music.connect(model='models/lyria-realtime-exp') as session:
    await session.set_weighted_prompts(prompts=[
        types.WeightedPrompt(text='minimal techno', weight=1.0)
    ])
    await session.set_music_generation_config(
        config=types.LiveMusicGenerationConfig(bpm=90, temperature=1.0)
    )
    await session.play()
    
    # Receive audio chunks
    async for message in session.receive():
        audio_data = message.server_content.audio_chunks[0].data
        # process PCM audio...

Control: session.play(), session.pause(), session.stop(), session.reset_context()
Steer by sending new weighted prompts mid-stream. Reset context after BPM/scale changes.

Computer Use

Browser automation agent. Requires Playwright or similar for action execution.

config = genai.types.GenerateContentConfig(
    tools=[types.Tool(
        computer_use=types.ComputerUse(
            environment=types.Environment.ENVIRONMENT_BROWSER,
            excluded_predefined_functions=["drag_and_drop"]  # optional
        )
    )]
)

response = client.models.generate_content(
    model='gemini-2.5-computer-use-preview-10-2025',
    contents=[{"role": "user", "parts": [{"text": "Search Amazon for wireless headphones"}]}],
    config=config
)

# Model returns function_calls with actions like type_text_at, click_at
# Check response.candidates[0].content.parts for function_call items
# Coordinates are normalized 0-999; convert to actual pixels
# Recommended screen: 1440x900

Safety: Check safety_decision in response — require_confirmation means pause before executing.

Safety Settings

response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Your prompt",
    config=types.GenerateContentConfig(
        safety_settings=[
            types.SafetySetting(
                category=types.HarmCategory.HARM_CATEGORY_HATE_SPEECH,
                threshold=types.HarmBlockThreshold.BLOCK_LOW_AND_ABOVE
            )
        ]
    )
)

Threshold options: OFF, BLOCK_NONE, BLOCK_ONLY_HIGH, BLOCK_MEDIUM_AND_ABOVE, BLOCK_LOW_AND_ABOVE

Default for Gemini 2.5/3: All filters OFF by default.

Categories: Harassment, Hate speech, Sexually explicit, Dangerous

Built-in protections (cannot be disabled): Child safety, etc.

Check blocked response: response.candidates[0].finish_reason == "SAFETY" → inspect safety_ratings.

Media Resolution

Control token usage for images/videos/PDFs:

Global (all models):

config = types.GenerateContentConfig(
    media_resolution=types.MediaResolution.MEDIA_RESOLUTION_HIGH  # LOW, MEDIUM, HIGH
)

Per-part (Gemini 3 only, experimental):

client = genai.Client(http_options={'api_version': 'v1alpha'})
image_part = types.Part.from_bytes(
    data=image_bytes, mime_type='image/jpeg',
    media_resolution=types.MediaResolution.MEDIA_RESOLUTION_HIGH
)

Long Context

Most Gemini models support 1M+ token context windows.

1M tokens ≈ 50K lines of code, 8 novels, 200 podcast transcripts.

Optimization: Use context caching when reusing large contexts — 4x cheaper (Flash) + lower latency.

Best practice: Put your query at the END of the prompt (after all context material).

Multi-needle limitation: Model performs ~99% on single retrieval but degrades with many simultaneous retrievals.

API Versions

Version	Use	Default?
`v1`	Stable, production	No
`v1beta`	New features, may change	Yes (SDK default)
`v1alpha`	Experimental only	No

client = genai.Client(http_options={'api_version': 'v1'})  # force stable

Authentication

API Key (default):

export GEMINI_API_KEY=your_key_here

REST header: x-goog-api-key: $GEMINI_API_KEY

OAuth (for production with stricter controls):

Enable Generative Language API in Cloud console
Configure OAuth consent screen
Create OAuth 2.0 Client ID
Use application-default-credentials

Rate Limits & Billing

Tiers: Free Tier → Paid Tier (pay-as-you-go)

Upgrade: AI Studio → API Keys → Set up Billing

Paid tier benefits: Higher rate limits, advanced models, data not used for training.

Rate limit headers: Check x-goog-quota-* headers in responses.

Error 429: Rate limit exceeded → implement exponential backoff or request quota increase.

Common Error Codes

HTTP	Status	Cause	Fix
400	INVALID_ARGUMENT	Malformed request	Check API reference
400	Missing thought_signature	Gemini 3 FC without signature	Use SDK chat or echo signatures
403	PERMISSION_DENIED	Wrong API key	Check key permissions
429	RESOURCE_EXHAUSTED	Rate limit hit	Backoff, upgrade tier
500	INTERNAL	Context too long / server error	Reduce context, retry
503	UNAVAILABLE	Overloaded	Retry or switch model
504	DEADLINE_EXCEEDED	Request too large	Increase timeout

Framework Integrations

CrewAI

from crewai import LLM
gemini_llm = LLM(model='gemini/gemini-3-flash-preview', api_key=api_key, temperature=1.0)

LangGraph

from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model="gemini-3-flash-preview")

LlamaIndex

from llama_index.llms.google_genai import GoogleGenAI
llm = GoogleGenAI(model="gemini-3-flash-preview")

Vercel AI SDK

npm install ai @ai-sdk/google

gemini-api-2026