literature-pdf-ocr-library
Installation
SKILL.md
Literature PDF OCR Library
Overview
Use this skill to build a real, traceable literature corpus instead of fabricating references or scraping arbitrary publisher pages. The default workflow is: narrow the topic, search official or stable APIs, download only legally accessible PDFs, run OCR or layout parsing, then emit a clean Markdown library with machine-readable metadata.
Canonical Directory Layout
In Oh My Paper projects, the corpus always lives under .pipeline/literature/<corpus-name>/.
In standalone projects, use research/literature/<corpus-name>/.
Never dump papers into the root or a flat directory without a corpus name.
.pipeline/
literature/
<corpus-name>/ ← one folder per topic/session, e.g. "humanoid-locomotion"
search_results.json ← raw search/ID-lookup results
library_index.json ← consolidated index for the whole corpus
library_index.jsonl
papers/
<arxiv-id>-<title-slug>/ ← one folder per paper
metadata.json
paper.pdf
ocr/ ← OCR output lives here, next to the PDF
paper/
doc_0.md ← main OCR markdown (PaddleOCR: multiple pages)
manifest.json
doc_0.md ← pdfminer fallback: single flat file
Rules:
--out-diralways points to.pipeline/literature/<corpus-name>/— never to.pipeline/literature/directly.- OCR output lives inside the paper's own folder (
papers/<slug>/ocr/), not in a top-levelocr/directory. - After OCR, record each paper's
ocr/path inliterature_bank.mdso agents can read the actual content.
Commands
# Download by arXiv IDs (recommended when IDs are known from web search)
python .claude/skills/literature-pdf-ocr-library/scripts/search_and_download_papers.py \
--arxiv-ids 2502.13817 2501.14459 \
--out-dir .pipeline/literature/my-corpus \
--download-pdfs
# Download by query
python .claude/skills/literature-pdf-ocr-library/scripts/search_and_download_papers.py \
--query "humanoid locomotion reinforcement learning" \
--out-dir .pipeline/literature/my-corpus \
--limit 20 --sources arxiv semanticscholar openalex hf_daily \
--download-pdfs
# OCR: PaddleOCR API (best quality)
export PADDLEOCR_TOKEN="<token>" # ask user, never hardcode
python .claude/skills/literature-pdf-ocr-library/scripts/paddleocr_layout_to_markdown.py \
.pipeline/literature/my-corpus/papers/*/paper.pdf \
--output-dir .pipeline/literature/my-corpus/papers \
--skip-existing
# OCR: pdfminer fallback (text-only, no layout — confirm with user first)
python .claude/skills/literature-pdf-ocr-library/scripts/paddleocr_layout_to_markdown.py \
.pipeline/literature/my-corpus/papers/*/paper.pdf \
--output-dir .pipeline/literature/my-corpus/papers \
--fallback-pdfminer
# Build index
python .claude/skills/literature-pdf-ocr-library/scripts/build_library_index.py \
--library-root .pipeline/literature/my-corpus
Resources
- Read source-strategy.md when you need source-specific behavior, file layout conventions, or legal constraints.
- Use
scripts/search_and_download_papers.pyfor traceable search and PDF download (supports--queryand--arxiv-ids). - Use
scripts/paddleocr_layout_to_markdown.pyfor single-file or batch OCR conversion (supports--fallback-pdfminer). - Use
scripts/build_library_index.pyto generatelibrary_index.jsonandlibrary_index.jsonl. - Use
scripts/ingest_literature_library.pywhen the user wants the full ingestion workflow in one go.
Related skills