local-ocr
Local OCR Pipeline Skill
Robust Optical Character Recognition (OCR) pipeline driven by ocrmypdf and tesseract.
Handles scanned PDFs, rotated image inputs, and raw text extraction securely and locally without external APIs.
Why not GPU via PyTorch/EasyOCR? The
ocrmypdftool is the industry standard for producing searchable PDFs. It leveragestesseractfor pixel-accurate text placement. A pure-CPU pipeline is leaner (avoids a 1.5GB PyTorch payload) and reliably embeds text exactly where it appears in the scanned image.
Capabilities
- Searchable PDF Generation: Converts rasterized/scanned PDFs or raw images (
.jpg,.png, etc.) into PDFs with a selectable, searchable text layer. - Auto-Rotation & Deskew: Automatically detects incorrectly rotated text and straightens crooked scans.
- Idempotent In-Place Processing: Safely processes files in-place using
--skip-text, preventing double-processing of a PDF that already has embedded text. - Structured JSON Output: All commands output structured JSON, making failure states (like missing dependencies) parseable by agents.
- Raw Text Extraction: Raw string extraction fallback for when agents need text directly in-memory instead of a PDF file.
Setup
# Installs system dependencies (tesseract, ocrmypdf, ghostscript) and sets up isolated venv
bash skills/ocr/scripts/setup.sh
Usage
uv run --project ~/.local-ocr scripts/ocr.py <command>
1. Generate a Searchable PDF (pdf)
Produces a standard, layered PDF. If you give it an image, it wraps it in a PDF. If you give it a scanned PDF, it adds the invisible text layer.
# Overwrites the file in-place, skipping it safely if it already contains text
uv run --project ~/.local-ocr scripts/ocr.py pdf ./scanned_invoice.pdf
# Output to a different file
uv run --project ~/.local-ocr scripts/ocr.py pdf ./scan_001.png -o ./contract.pdf
# Force reprocessing (ignore existing text layer)
uv run --project ~/.local-ocr scripts/ocr.py pdf ./scanned_invoice.pdf --force
Note: By default, auto-rotate and deskew are enabled. Disable with --no-rotate or --no-deskew.
2. Batch Process a Directory (batch)
Recursively scans a directory for images and PDFs, applying OCR.
# Process all files. Skips already-OCRed PDFs.
uv run --project ~/.local-ocr scripts/ocr.py batch ./archives/
3. Extract Raw Text (text)
Does not create a PDF. Just reads the words off the page and returns them as a JSON string. Good for agents reading documents on the fly.
uv run --project ~/.local-ocr scripts/ocr.py text ./han_solo_invoice.png
Franchise Examples (Star Wars)
- Process the Death Star blueprints:
uv run --project ~/.local-ocr scripts/ocr.py pdf ./ds-1_schematics.pdf - Extract raw orders:
uv run --project ~/.local-ocr scripts/ocr.py text ./order_66_memo.jpg - Archive run:
uv run --project ~/.local-ocr scripts/ocr.py batch /archives/jedi_temple
Troubleshooting
- File already contains text: This is the most common "error", but it isn't an error.
ocrmypdfreturns exit code 6 when it skips a file that already has text. The wrapper script catches this and reports a JSON"status": "success"with a message noting the side-step. - Dependencies Missing: Run the
setup.shscript again if the agent complains about missingtesseractor Python modules.