skills/nousresearch/hermes-agent/ocr-and-documents

ocr-and-documents

SKILL.md

PDF & Document Extraction

For DOCX: use python-docx (parses actual document structure, far better than OCR). For PPTX: see the powerpoint skill (uses python-pptx with full slide/notes support). This skill covers PDFs and scanned documents.

Step 1: Remote URL Available?

If the document has a URL, always try web_extract first:

web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])

This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.

Only use local extraction when: the file is local, web_extract fails, or you need batch processing.

Step 2: Choose Local Extractor

Feature pymupdf (~25MB) marker-pdf (~3-5GB)
Text-based PDF
Scanned PDF (OCR) ✅ (90+ languages)
Tables ✅ (basic) ✅ (high accuracy)
Equations / LaTeX
Code blocks
Forms
Headers/footers removal
Reading order detection
Images extraction ✅ (embedded) ✅ (with context)
Images → text (OCR)
EPUB
Markdown output ✅ (via pymupdf4llm) ✅ (native, higher quality)
Install size ~25MB ~3-5GB (PyTorch + models)
Speed Instant ~1-14s/page (CPU), ~0.2s/page (GPU)

Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.

If the user needs marker capabilities but the system lacks ~5GB free disk:

"This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."


pymupdf (lightweight)

pip install pymupdf pymupdf4llm

Via helper script:

python scripts/extract_pymupdf.py document.pdf              # Plain text
python scripts/extract_pymupdf.py document.pdf --markdown    # Markdown
python scripts/extract_pymupdf.py document.pdf --tables      # Tables
python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
python scripts/extract_pymupdf.py document.pdf --metadata    # Title, author, pages
python scripts/extract_pymupdf.py document.pdf --pages 0-4   # Specific pages

Inline:

python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
    print(page.get_text())
"

marker-pdf (high-quality OCR)

# Check disk space first
python scripts/extract_marker.py --check

pip install marker-pdf

Via helper script:

python scripts/extract_marker.py document.pdf                # Markdown
python scripts/extract_marker.py document.pdf --json         # JSON with metadata
python scripts/extract_marker.py document.pdf --output_dir out/  # Save images
python scripts/extract_marker.py scanned.pdf                 # Scanned PDF (OCR)
python scripts/extract_marker.py document.pdf --use_llm      # LLM-boosted accuracy

CLI (installed with marker-pdf):

marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4    # Batch

Arxiv Papers

# Abstract only (fast)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])

# Full paper
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

# Search
web_search(query="arxiv GRPO reinforcement learning 2026")

Notes

  • web_extract is always first choice for URLs
  • pymupdf is the safe default — instant, no models, works everywhere
  • marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
  • Both helper scripts accept --help for full usage
  • marker-pdf downloads ~2.5GB of models to ~/.cache/huggingface/ on first use
  • For Word docs: pip install python-docx (better than OCR — parses actual structure)
  • For PowerPoint: see the powerpoint skill (uses python-pptx)
Weekly Installs
5
GitHub Stars
7.5K
First Seen
11 days ago
Installed on
opencode5
cursor5
gemini-cli4
claude-code4
github-copilot4
codex4