gemini-ocr-cli
Installation
SKILL.md
gemini-ocr-cli
When to use
Use this skill when a task needs promptable Gemini-based file analysis from a local terminal workflow and the host model’s own multimodal behavior is not reliable or controllable enough.
Use it especially when an agent must:
- analyze a local image or PDF with a custom prompt,
- OCR a local image into faithful Markdown,
- OCR a local PDF into faithful Markdown,
- preserve page structure, tables, and labels as well as possible,
- work through a deterministic CLI instead of ad-hoc multimodal prompting.
Preconditions
- Ensure the CLI is globally available on
PATH:- preferred:
npm install -g @codecell-germany/gemini-ocr-agent-skill - verify:
gemini-ocr --help
- preferred:
- Install the skill payload if needed:
gemini-ocr-skill install --force
- Required secret env:
GEMINI_API_KEY=<api-key>
- Supported alias secret env:
GOOGLE_GENERATIVE_AI_API_KEY=<api-key>
- Supported fallback secret env:
GOOGLE_API_KEY=<api-key>
- Optional model override:
GEMINI_OCR_MODEL=gemini-3.1-flash-lite-preview
- Optional reasoning-level override:
GEMINI_OCR_REASONING_LEVEL=medium
Core workflow
- Verify the public CLI surface:
gemini-ocr --help
- Validate the environment:
gemini-ocr doctor --json
- If the environment is incomplete, print the setup guide:
gemini-ocr setup --language engemini-ocr setup --language de
- Export one of the supported API key env vars and rerun:
gemini-ocr doctor --json
- For general promptable file analysis:
gemini-ocr analyze-file /absolute/path/to/file.png --prompt "Describe this image precisely." --reasoning-level low
- OCR an image with the document preset:
gemini-ocr scan-image /absolute/path/to/image.png
- OCR a PDF with the document preset:
gemini-ocr scan-pdf /absolute/path/to/document.pdf
- Use JSON output only when the calling workflow explicitly needs the legacy structured OCR object:
gemini-ocr scan-pdf /absolute/path/to/document.pdf --format json
Guardrails
- Use the public CLI names
gemini-ocrandgemini-ocr-skill. - Do not bypass the product surface with repo-local entrypoints such as
node dist/index.js. - Do not call hidden installed runtime paths such as
~/.codex/tools/gemini-ocr-cli/dist/index.js. - Inputs are sent to Gemini for remote processing. Do not use the tool on documents that must stay fully local unless that remote-processing policy is acceptable.
- Default output in
analyze-fileis free text on stdout. - Default output in
scan-image/scan-pdfis Markdown on stdout. - Use
--reasoning-level minimal|low|medium|highwhen Gemini 3.1 Flash-Lite should trade latency against deeper reasoning more explicitly. - Diagnostics and warnings belong on stderr.
--format jsonis now a document-mode legacy escape hatch, not the main workflow.--pdf-mode autocan retry a failed native PDF request as raster OCR whenpdftoppmis available.--pdf-mode rasterrequirespdftoppm.- API keys remain in shell env. Do not paste them into prompts, tickets, screenshots, or chats.
- Prefer
doctorbefore the first real OCR run in a new shell or environment.
References
- Main overview:
references/overview.md - Agent onboarding:
references/agent-onboarding.md - OCR first run:
references/ocr-first-run.md - Command cheat sheet:
references/command-cheatsheet.md - Architecture:
knowledge/ARCHITECTURE.md - Release checklist:
knowledge/RELEASE_CHECKLIST.md