local-ocr

SKILL.md

Local OCR Pipeline Skill

Robust Optical Character Recognition (OCR) pipeline driven by ocrmypdf and tesseract. Handles scanned PDFs, rotated image inputs, and raw text extraction securely and locally without external APIs.

Why not GPU via PyTorch/EasyOCR? The ocrmypdf tool is the industry standard for producing searchable PDFs. It leverages tesseract for pixel-accurate text placement. A pure-CPU pipeline is leaner (avoids a 1.5GB PyTorch payload) and reliably embeds text exactly where it appears in the scanned image.

Capabilities

  1. Searchable PDF Generation: Converts rasterized/scanned PDFs or raw images (.jpg, .png, etc.) into PDFs with a selectable, searchable text layer.
  2. Auto-Rotation & Deskew: Automatically detects incorrectly rotated text and straightens crooked scans.
  3. Idempotent In-Place Processing: Safely processes files in-place using --skip-text, preventing double-processing of a PDF that already has embedded text.
  4. Structured JSON Output: All commands output structured JSON, making failure states (like missing dependencies) parseable by agents.
  5. Raw Text Extraction: Raw string extraction fallback for when agents need text directly in-memory instead of a PDF file.

Setup

# Installs system dependencies (tesseract, ocrmypdf, ghostscript) and sets up isolated venv
bash skills/ocr/scripts/setup.sh

Usage

uv run --project ~/.local-ocr scripts/ocr.py <command>

1. Generate a Searchable PDF (pdf)

Produces a standard, layered PDF. If you give it an image, it wraps it in a PDF. If you give it a scanned PDF, it adds the invisible text layer.

# Overwrites the file in-place, skipping it safely if it already contains text
uv run --project ~/.local-ocr scripts/ocr.py pdf ./scanned_invoice.pdf

# Output to a different file
uv run --project ~/.local-ocr scripts/ocr.py pdf ./scan_001.png -o ./contract.pdf

# Force reprocessing (ignore existing text layer)
uv run --project ~/.local-ocr scripts/ocr.py pdf ./scanned_invoice.pdf --force

Note: By default, auto-rotate and deskew are enabled. Disable with --no-rotate or --no-deskew.

2. Batch Process a Directory (batch)

Recursively scans a directory for images and PDFs, applying OCR.

# Process all files. Skips already-OCRed PDFs.
uv run --project ~/.local-ocr scripts/ocr.py batch ./archives/

3. Extract Raw Text (text)

Does not create a PDF. Just reads the words off the page and returns them as a JSON string. Good for agents reading documents on the fly.

uv run --project ~/.local-ocr scripts/ocr.py text ./han_solo_invoice.png

Franchise Examples (Star Wars)

  • Process the Death Star blueprints: uv run --project ~/.local-ocr scripts/ocr.py pdf ./ds-1_schematics.pdf
  • Extract raw orders: uv run --project ~/.local-ocr scripts/ocr.py text ./order_66_memo.jpg
  • Archive run: uv run --project ~/.local-ocr scripts/ocr.py batch /archives/jedi_temple

Troubleshooting

  • File already contains text: This is the most common "error", but it isn't an error. ocrmypdf returns exit code 6 when it skips a file that already has text. The wrapper script catches this and reports a JSON "status": "success" with a message noting the side-step.
  • Dependencies Missing: Run the setup.sh script again if the agent complains about missing tesseract or Python modules.
Weekly Installs
23
First Seen
Feb 28, 2026
Installed on
gemini-cli23
github-copilot23
codex23
amp23
cline23
kimi-cli23