gemini-ocr-cli

Installation
SKILL.md

gemini-ocr-cli

When to use

Use this skill when a task needs promptable Gemini-based file analysis from a local terminal workflow and the host model’s own multimodal behavior is not reliable or controllable enough.

Use it especially when an agent must:

  • analyze a local image or PDF with a custom prompt,
  • OCR a local image into faithful Markdown,
  • OCR a local PDF into faithful Markdown,
  • preserve page structure, tables, and labels as well as possible,
  • work through a deterministic CLI instead of ad-hoc multimodal prompting.

Preconditions

  • Ensure the CLI is globally available on PATH:
    • preferred: npm install -g @codecell-germany/gemini-ocr-agent-skill
    • verify: gemini-ocr --help
  • Install the skill payload if needed:
    • gemini-ocr-skill install --force
  • Required secret env:
    • GEMINI_API_KEY=<api-key>
  • Supported alias secret env:
    • GOOGLE_GENERATIVE_AI_API_KEY=<api-key>
  • Supported fallback secret env:
    • GOOGLE_API_KEY=<api-key>
  • Optional model override:
    • GEMINI_OCR_MODEL=gemini-3.1-flash-lite-preview
  • Optional reasoning-level override:
    • GEMINI_OCR_REASONING_LEVEL=medium

Core workflow

  1. Verify the public CLI surface:
  • gemini-ocr --help
  1. Validate the environment:
  • gemini-ocr doctor --json
  1. If the environment is incomplete, print the setup guide:
  • gemini-ocr setup --language en
  • gemini-ocr setup --language de
  1. Export one of the supported API key env vars and rerun:
  • gemini-ocr doctor --json
  1. For general promptable file analysis:
  • gemini-ocr analyze-file /absolute/path/to/file.png --prompt "Describe this image precisely." --reasoning-level low
  1. OCR an image with the document preset:
  • gemini-ocr scan-image /absolute/path/to/image.png
  1. OCR a PDF with the document preset:
  • gemini-ocr scan-pdf /absolute/path/to/document.pdf
  1. Use JSON output only when the calling workflow explicitly needs the legacy structured OCR object:
  • gemini-ocr scan-pdf /absolute/path/to/document.pdf --format json

Guardrails

  • Use the public CLI names gemini-ocr and gemini-ocr-skill.
  • Do not bypass the product surface with repo-local entrypoints such as node dist/index.js.
  • Do not call hidden installed runtime paths such as ~/.codex/tools/gemini-ocr-cli/dist/index.js.
  • Inputs are sent to Gemini for remote processing. Do not use the tool on documents that must stay fully local unless that remote-processing policy is acceptable.
  • Default output in analyze-file is free text on stdout.
  • Default output in scan-image / scan-pdf is Markdown on stdout.
  • Use --reasoning-level minimal|low|medium|high when Gemini 3.1 Flash-Lite should trade latency against deeper reasoning more explicitly.
  • Diagnostics and warnings belong on stderr.
  • --format json is now a document-mode legacy escape hatch, not the main workflow.
  • --pdf-mode auto can retry a failed native PDF request as raster OCR when pdftoppm is available.
  • --pdf-mode raster requires pdftoppm.
  • API keys remain in shell env. Do not paste them into prompts, tickets, screenshots, or chats.
  • Prefer doctor before the first real OCR run in a new shell or environment.

References

  • Main overview: references/overview.md
  • Agent onboarding: references/agent-onboarding.md
  • OCR first run: references/ocr-first-run.md
  • Command cheat sheet: references/command-cheatsheet.md
  • Architecture: knowledge/ARCHITECTURE.md
  • Release checklist: knowledge/RELEASE_CHECKLIST.md
Installs
3
First Seen
Apr 11, 2026