gemini-ocr-cli

When to use

Use this skill when a task needs promptable Gemini-based file analysis from a local terminal workflow and the host model’s own multimodal behavior is not reliable or controllable enough.

Use it especially when an agent must:

analyze a local image or PDF with a custom prompt,
OCR a local image into faithful Markdown,
OCR a local PDF into faithful Markdown,
preserve page structure, tables, and labels as well as possible,
work through a deterministic CLI instead of ad-hoc multimodal prompting.

Preconditions

Ensure the CLI is globally available on PATH:
- preferred: npm install -g @codecell-germany/gemini-ocr-agent-skill
- verify: gemini-ocr --help
Install the skill payload if needed:
- gemini-ocr-skill install --force
Required secret env:
- GEMINI_API_KEY=<api-key>
Supported alias secret env:
- GOOGLE_GENERATIVE_AI_API_KEY=<api-key>
Supported fallback secret env:
- GOOGLE_API_KEY=<api-key>
Optional model override:
- GEMINI_OCR_MODEL=gemini-3.1-flash-lite-preview
Optional reasoning-level override:
- GEMINI_OCR_REASONING_LEVEL=medium

Core workflow

Verify the public CLI surface:

gemini-ocr --help

Validate the environment:

gemini-ocr doctor --json

If the environment is incomplete, print the setup guide:

gemini-ocr setup --language en
gemini-ocr setup --language de

Export one of the supported API key env vars and rerun:

gemini-ocr doctor --json

For general promptable file analysis:

gemini-ocr analyze-file /absolute/path/to/file.png --prompt "Describe this image precisely." --reasoning-level low

OCR an image with the document preset:

gemini-ocr scan-image /absolute/path/to/image.png

OCR a PDF with the document preset:

gemini-ocr scan-pdf /absolute/path/to/document.pdf

Use JSON output only when the calling workflow explicitly needs the legacy structured OCR object:

gemini-ocr scan-pdf /absolute/path/to/document.pdf --format json

Guardrails

Use the public CLI names gemini-ocr and gemini-ocr-skill.
Do not bypass the product surface with repo-local entrypoints such as node dist/index.js.
Do not call hidden installed runtime paths such as ~/.codex/tools/gemini-ocr-cli/dist/index.js.
Inputs are sent to Gemini for remote processing. Do not use the tool on documents that must stay fully local unless that remote-processing policy is acceptable.
Default output in analyze-file is free text on stdout.
Default output in scan-image / scan-pdf is Markdown on stdout.
Use --reasoning-level minimal|low|medium|high when Gemini 3.1 Flash-Lite should trade latency against deeper reasoning more explicitly.
Diagnostics and warnings belong on stderr.
--format json is now a document-mode legacy escape hatch, not the main workflow.
--pdf-mode auto can retry a failed native PDF request as raster OCR when pdftoppm is available.
--pdf-mode raster requires pdftoppm.
API keys remain in shell env. Do not paste them into prompts, tickets, screenshots, or chats.
Prefer doctor before the first real OCR run in a new shell or environment.

References

Main overview: references/overview.md
Agent onboarding: references/agent-onboarding.md
OCR first run: references/ocr-first-run.md
Command cheat sheet: references/command-cheatsheet.md
Architecture: knowledge/ARCHITECTURE.md
Release checklist: knowledge/RELEASE_CHECKLIST.md