PDF Processing Guide
Overview
This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see REFERENCE.md. If you need to fill out a PDF form, read FORMS.md and follow its instructions.
CRITICAL: Smart PDF Reading — Avoid Context Overflow
Claude's Read tool converts each PDF page into an image. The API has a hard limit of 100 images per conversation. A 90+ page PDF will fail outright, and even smaller PDFs can consume enormous context budget (each page-image costs far more tokens than equivalent plain text).
This is the #1 cause of failures when processing PDFs. Always think before you read.
Step 0: Probe First, Read Later
For any PDF the user uploads or asks you to read, run the probe script first to understand what you're dealing with:
python scripts/probe_pdf.py <file.pdf>
This costs zero context tokens and gives you: page count, content type (text-dense / slides / scanned), estimated token cost, whether a TOC exists, and a recommended reading strategy. Follow the recommended strategy.
If probe_pdf.py is not available (e.g. running outside the skill directory), do the manual equivalent:
pdfinfo <file.pdf> | grep Pages
pdftotext -f 1 -l 3 <file.pdf> - | head -100
This tells you the page count and whether pdftotext produces real text (vs. empty output for scanned PDFs).
Decision Tree
Is page count known?
No → Run: pdfinfo <file.pdf> | grep Pages
Yes ↓
Is the PDF likely scanned (pdftotext produces < 50 chars/page)?
Yes → Go to "Scanned PDF Path" below
No ↓
Page count?
≤ 10 pages
→ Read directly with Read tool. Safe.
11-50 pages AND content is sparse (slides, < 500 chars/page)
→ pdftotext full extraction, read the .txt
11-50 pages AND content is dense (> 500 chars/page)
→ pdftotext full extraction
→ Check: if .txt file > 40k tokens (~120k chars), read in chunks
→ Otherwise read the .txt in full
51-150 pages
→ NEVER read the PDF directly
→ pdftotext -f 1 -l 5 → read overview/TOC first
→ Then extract specific sections by page range as needed
→ If user needs full coverage: python scripts/smart_read.py
> 150 pages
→ NEVER read the PDF directly
→ MUST use chunked smart reading
→ python scripts/smart_read.py <file.pdf> --output-dir <dir>
→ Read index.json first (~small), then read specific chunks on demand
Scanned PDF Path
When pdftotext produces empty or garbled output (< 50 chars/page average), the PDF is likely scanned or image-based:
-
Try OCR first (if pytesseract and pdf2image are available):
from pdf2image import convert_from_path import pytesseract # Process in small batches to control memory for start in range(0, total_pages, 5): images = convert_from_path(pdf_path, first_page=start+1, last_page=min(start+5, total_pages)) for i, img in enumerate(images): text = pytesseract.image_to_string(img) # Save text to file... -
If OCR is not available or quality is poor, fall back to reading pages as images with the Read tool — but strictly limit to 10 pages per batch. Ask the user which pages matter most.
Chunked Smart Reading (for large PDFs)
For PDFs over ~50 dense pages where the user needs comprehensive understanding:
python scripts/smart_read.py <file.pdf> --output-dir <dir> --chunk-size 15
This produces:
index.json— A manifest listing every chunk with page ranges, character counts, estimated tokens, and a preview of the first linechunk_001.txt,chunk_002.txt, ... — The actual text, split by page range
Workflow:
- Read
index.json(small, fits easily in context) - Identify which chunks are relevant based on the user's question
- Read only those specific chunk files
- If the user needs a full summary, process chunks one at a time and build up notes incrementally — do NOT try to read all chunks into context simultaneously
Delegating to Subagents
When spawning Task agents to process PDF content, always extract text BEFORE spawning the agent and pass the .txt path instead. Use strong prohibitions in the prompt:
MANDATORY: Do NOT read any .pdf file with the Read tool. ONLY read the .txt files provided.
PDF files: {list of .txt paths}
Agents tend to ignore soft preferences ("prefer reading the text file") but will obey strong prohibitions. This single pattern prevents subagents from accidentally consuming the image budget.
Quick Start
from pypdf import PdfReader, PdfWriter
# Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
# Extract text
text = ""
for page in reader.pages:
text += page.extract_text()
Python Libraries
pypdf - Basic Operations
Merge PDFs
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
Split PDF
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)
Extract Metadata
reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")
Rotate Pages
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90) # Rotate 90 degrees clockwise
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)
pdfplumber - Text and Table Extraction
Extract Text with Layout
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
Extract Tables
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
print(f"Table {j+1} on page {i+1}:")
for row in table:
print(row)
Advanced Table Extraction
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table:
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)
if all_tables:
combined_df = pd.concat(all_tables, ignore_index=True)
combined_df.to_excel("extracted_tables.xlsx", index=False)
reportlab - Create PDFs
Basic PDF Creation
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter
c.drawString(100, height - 100, "Hello World!")
c.drawString(100, height - 120, "This is a PDF created with reportlab")
c.line(100, height - 140, 400, height - 140)
c.save()
Create PDF with Multiple Pages
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
title = Paragraph("Report Title", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
story.append(body)
story.append(PageBreak())
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Content for page 2", styles['Normal']))
doc.build(story)
Command-Line Tools
pdftotext (poppler-utils)
# Extract text
pdftotext input.pdf output.txt
# Extract text preserving layout
pdftotext -layout input.pdf output.txt
# Extract specific pages
pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5
qpdf
# Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf
# Rotate pages
qpdf input.pdf output.pdf --rotate=+90:1
# Remove password
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
pdftk (if available)
# Merge
pdftk file1.pdf file2.pdf cat output merged.pdf
# Split
pdftk input.pdf burst
# Rotate
pdftk input.pdf rotate 1east output rotated.pdf
Common Tasks
Extract Text from Scanned PDFs
# Requires: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path
images = convert_from_path('scanned.pdf')
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)
Add Watermark
from pypdf import PdfReader, PdfWriter
watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)
Extract Images
# Using pdfimages (poppler-utils)
pdfimages -j input.pdf output_prefix
Password Protection
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)
Quick Reference
| Task | Best Tool | Command/Code |
|---|---|---|
| Probe PDF | probe_pdf.py | python scripts/probe_pdf.py <file.pdf> |
| Smart chunked read | smart_read.py | python scripts/smart_read.py <file.pdf> --output-dir <dir> |
| Merge PDFs | pypdf | writer.add_page(page) |
| Split PDFs | pypdf | One page per file |
| Extract text | pdfplumber | page.extract_text() |
| Extract tables | pdfplumber | page.extract_tables() |
| Create PDFs | reportlab | Canvas or Platypus |
| Command line merge | qpdf | qpdf --empty --pages ... |
| OCR scanned PDFs | pytesseract | Convert to image first |
| Fill PDF forms | See FORMS.md | See FORMS.md |
Next Steps
- For advanced pypdfium2 usage, see REFERENCE.md
- For JavaScript libraries (pdf-lib), see REFERENCE.md
- If you need to fill out a PDF form, follow the instructions in FORMS.md
- For troubleshooting guides, see REFERENCE.md