PDF Processing Guide
Reference Files
Detailed guides for specific tasks and libraries:
- python-libraries.md - Comprehensive Python library examples (pypdf, pdfplumber, reportlab)
- cli-tools.md - Command-line tools reference (pdftotext, qpdf, pdftk)
- reference.md - Advanced features (pypdfium2, pdf-lib JavaScript, OCR)
- forms.md - Complete workflow for filling PDF forms
Quick Start
Extract Text
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
Extract Tables
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for table in tables:
print(table)
Merge PDFs
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
Split PDF into Pages
from pypdf import PdfWriter, PdfReader
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)
Create PDF
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("hello.pdf", pagesize=letter)
c.drawString(100, 750, "Hello World!")
c.save()
Common Workflows
Fill Out a Form
PDF forms can be fillable (with form fields) or non-fillable (requiring manual positioning). For complete step-by-step instructions, see forms.md.
Extract and Analyze Data
Combine text extraction with JSON export for downstream processing:
import pdfplumber
import json
# Extract all text
with pdfplumber.open("document.pdf") as pdf:
full_text = "\n".join(
page.extract_text() or "" for page in pdf.pages
)
# Extract tables as structured data
data = []
with pdfplumber.open("document.pdf") as pdf:
for page_num, page in enumerate(pdf.pages, 1):
for table in (page.extract_tables() or []):
data.append({"page": page_num, "data": table})
with open("output.json", "w") as f:
json.dump(data, f, indent=2)
Process Scanned PDFs (OCR)
Extract text from image-based PDFs using OCR. See reference.md for detailed OCR examples.
Add Password Protection
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.encrypt("password")
with open("encrypted.pdf", "wb") as output:
writer.write(output)
Tool Selection Guide
| Task | Recommended Tool | See Also |
|---|---|---|
| Extract text/tables | pdfplumber | python-libraries.md |
| Merge/split/rotate | pypdf | python-libraries.md |
| Create PDFs | reportlab | python-libraries.md |
| Command-line operations | qpdf/pdftotext | cli-tools.md |
| Fill forms | pypdf/pdf-lib | forms.md |
| Scanned PDFs/OCR | pytesseract | reference.md |
| Advanced rendering | pypdfium2 | reference.md |
| JavaScript context | pdf-lib | reference.md |
Next Steps
- Filling a form? → forms.md
- Need Python library details? → python-libraries.md
- Using command line? → cli-tools.md
- Advanced features (OCR, rendering, JS)? → reference.md
More from philoserf/claude-code-setup
improve-instructions
>-
28map-codebase
>-
16uv-package-manager
Expert in uv, the ultra-fast Python package manager and project tool. Use when setting up Python projects, managing dependencies, creating virtual environments, installing Python versions, working with lockfiles, migrating from pip/poetry/pip-tools, or optimizing Python workflows with uv's blazing-fast performance.
5organize-folders
Provides guidance on organizing folder structures and file system layouts for any project. Use when planning project organization, reorganizing messy directories, setting up folder hierarchies, designing directory layouts, structuring repositories, cleaning up files, suggesting folder structures, establishing naming conventions, or when you need help with folder structure or file organization. Helps with writing projects, code projects, document collections, or any file organization task.
3editing-assistant
Text editing assistant with specialized modes for typos, grammar, flow, headings, citations, and more. Use when editing, proofreading, or improving written content including documentation and markdown files.
3hook-audit
Audits Claude Code hooks for correctness, safety, and performance. Use when reviewing, validating, or debugging hooks, checking exit codes, error handling, or learning hook best practices.
2