Extracting PDF Text for LLMs

This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.

Quick Decision Guide

PDF Type	Best Approach	Script
Simple text PDF	PyMuPDF	`scripts/extract_pymupdf.py`
PDF with tables	pdfplumber	`scripts/extract_pdfplumber.py`
Scanned/image PDF (local)	pytesseract	`scripts/extract_with_ocr.py`
Complex layout, highest accuracy	Mistral OCR API	`scripts/extract_mistral_ocr.py`
End-to-end RAG pipeline	marker-pdf	`pip install marker-pdf`

Try PyMuPDF first - fastest, handles most text-based PDFs well
If tables are mangled - switch to pdfplumber
If scanned/image-based - use Mistral OCR API (best accuracy) or local OCR (free but slower)