pdf-processing
PDF Processing
Overview
Generate, manipulate, and extract data from PDF documents. This skill covers the Python PDF ecosystem: pypdf for merging/splitting/metadata, pdfplumber for text and table extraction, reportlab for generation, pytesseract for OCR, and strategies for form filling, watermarking, and complex document assembly.
Apply this skill whenever PDFs need to be created, parsed, transformed, or combined through code.
Multi-Phase Process
Phase 1: Requirements
- Determine operation type (generate, extract, manipulate)
- Identify input PDF characteristics (scanned, digital, forms)
- Define output requirements (format, quality, size)
- Plan data pipeline (source data to PDF or PDF to data)
- Assess volume and performance requirements
STOP — Do NOT select a library until the operation type and input characteristics are clear.
Phase 2: Implementation
- Select appropriate library for the task (see decision table)
- Implement core processing logic
- Handle edge cases (corrupted files, encrypted PDFs, mixed content)
- Add error handling and validation
- Optimize for file size and processing speed
STOP — Do NOT skip edge case handling for encrypted, rotated, or scanned PDFs.
Phase 3: Validation
- Verify output renders correctly in multiple PDF viewers
- Check text is selectable (not rasterized) when applicable
- Validate extracted data accuracy
- Test with edge case PDFs (large, encrypted, scanned)
- Verify accessibility (tagged PDF where needed)
Library Selection Decision Table
| Task | Library | Why | Alternative |
|---|---|---|---|
| Text extraction | pdfplumber | Best accuracy, handles layouts | pypdf (simpler, less accurate) |
| Table extraction | pdfplumber | Structured table parsing | camelot (dedicated table tool) |
| PDF generation | reportlab | Full control, professional quality | weasyprint (HTML-to-PDF) |
| Merge / split | pypdf | Simple, reliable, fast | — |
| Form filling | pypdf | Reads and fills AcroForms | pdfrw (alternative API) |
| Metadata read/write | pypdf | Read/write PDF properties | — |
| OCR (scanned docs) | pytesseract + pdf2image | Scanned document text extraction | EasyOCR (deep learning) |
| Watermarking | pypdf + reportlab | Overlay pages | — |
| HTML to PDF | weasyprint | CSS-based layout, server-friendly | playwright (browser rendering) |
PDF Generation with ReportLab
from reportlab.lib.pagesizes import A4
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import cm, mm
from reportlab.lib.colors import HexColor
from reportlab.platypus import (
SimpleDocTemplate, Paragraph, Spacer, Table,
TableStyle, Image, PageBreak
)
from reportlab.lib import colors
def generate_report(output_path, data):
doc = SimpleDocTemplate(
output_path,
pagesize=A4,
topMargin=2.5*cm,
bottomMargin=2.5*cm,
leftMargin=2.5*cm,
rightMargin=2.5*cm,
)
styles = getSampleStyleSheet()
styles.add(ParagraphStyle(
name='CustomTitle',
parent=styles['Title'],
fontSize=24,
textColor=HexColor('#2F5496'),
spaceAfter=20,
))
story = []
# Title
story.append(Paragraph(data['title'], styles['CustomTitle']))
story.append(Spacer(1, 12))
# Body text
story.append(Paragraph(data['body'], styles['Normal']))
story.append(Spacer(1, 20))
# Table
table_data = [['Name', 'Value', 'Status']]
for row in data['rows']:
table_data.append([row['name'], row['value'], row['status']])
table = Table(table_data, colWidths=[6*cm, 4*cm, 4*cm])
table.setStyle(TableStyle([
('BACKGROUND', (0, 0), (-1, 0), HexColor('#2F5496')),
('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
('FONTSIZE', (0, 0), (-1, 0), 11),
('ALIGN', (0, 0), (-1, -1), 'CENTER'),
('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, HexColor('#F0F4FA')]),
('TOPPADDING', (0, 0), (-1, -1), 8),
('BOTTOMPADDING', (0, 0), (-1, -1), 8),
]))
story.append(table)
doc.build(story)
Custom Page Template (Headers/Footers)
from reportlab.platypus import BaseDocTemplate, Frame, PageTemplate
from datetime import datetime
def add_header_footer(canvas, doc):
canvas.saveState()
# Header
canvas.setFont('Helvetica', 9)
canvas.setFillColor(HexColor('#888888'))
canvas.drawString(2.5*cm, A4[1] - 1.5*cm, 'Company Name — Confidential')
canvas.drawRightString(A4[0] - 2.5*cm, A4[1] - 1.5*cm, f'Page {doc.page}')
# Footer
canvas.drawCentredString(A4[0]/2, 1.5*cm, f'Generated on {datetime.now():%Y-%m-%d}')
canvas.restoreState()
doc = BaseDocTemplate(output_path, pagesize=A4)
frame = Frame(2.5*cm, 2.5*cm, A4[0]-5*cm, A4[1]-5*cm)
doc.addPageTemplates([PageTemplate(id='main', frames=[frame], onPage=add_header_footer)])
Text and Table Extraction
pdfplumber
import pdfplumber
with pdfplumber.open('document.pdf') as pdf:
# Extract text from all pages
full_text = ''
for page in pdf.pages:
full_text += page.extract_text() + '\n'
# Extract tables
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
# Extract text from specific area
page = pdf.pages[0]
bbox = (50, 100, 400, 300) # (x0, top, x1, bottom)
cropped = page.within_bbox(bbox)
text = cropped.extract_text()
Table Extraction Settings
table_settings = {
"vertical_strategy": "lines", # or "text", "explicit"
"horizontal_strategy": "lines",
"snap_tolerance": 3,
"join_tolerance": 3,
"edge_min_length": 3,
"min_words_vertical": 3,
"min_words_horizontal": 1,
}
tables = page.extract_tables(table_settings)
Form Filling
from pypdf import PdfReader, PdfWriter
reader = PdfReader('form.pdf')
writer = PdfWriter()
writer.append(reader)
# Fill form fields
writer.update_page_form_field_values(
writer.pages[0],
{
'full_name': 'Alice Johnson',
'email': 'alice@example.com',
'date': '2025-03-15',
'agree_terms': '/Yes', # Checkbox
},
auto_regenerate=False,
)
with open('filled_form.pdf', 'wb') as f:
writer.write(f)
OCR (Scanned PDFs)
from pdf2image import convert_from_path
import pytesseract
def ocr_pdf(pdf_path, language='eng'):
images = convert_from_path(pdf_path, dpi=300)
full_text = ''
for i, image in enumerate(images):
text = pytesseract.image_to_string(image, lang=language)
full_text += f'\n--- Page {i+1} ---\n{text}'
return full_text
# For better accuracy with specific layouts:
def ocr_with_config(image):
custom_config = r'--oem 3 --psm 6' # LSTM engine, assume uniform block
return pytesseract.image_to_string(image, config=custom_config)
Merge and Split
from pypdf import PdfReader, PdfWriter
# Merge multiple PDFs
def merge_pdfs(input_paths, output_path):
writer = PdfWriter()
for path in input_paths:
reader = PdfReader(path)
for page in reader.pages:
writer.add_page(page)
with open(output_path, 'wb') as f:
writer.write(f)
# Split PDF by page ranges
def split_pdf(input_path, ranges, output_dir):
reader = PdfReader(input_path)
for i, (start, end) in enumerate(ranges):
writer = PdfWriter()
for page_num in range(start - 1, min(end, len(reader.pages))):
writer.add_page(reader.pages[page_num])
with open(f'{output_dir}/part_{i+1}.pdf', 'wb') as f:
writer.write(f)
# Extract specific pages
def extract_pages(input_path, page_numbers, output_path):
reader = PdfReader(input_path)
writer = PdfWriter()
for num in page_numbers:
writer.add_page(reader.pages[num - 1])
with open(output_path, 'wb') as f:
writer.write(f)
Watermarking
from pypdf import PdfReader, PdfWriter
from reportlab.pdfgen import canvas as rl_canvas
from reportlab.lib.pagesizes import A4
from io import BytesIO
def create_watermark(text, opacity=0.1):
buffer = BytesIO()
c = rl_canvas.Canvas(buffer, pagesize=A4)
c.setFillAlpha(opacity)
c.setFont('Helvetica-Bold', 60)
c.setFillColorRGB(0.5, 0.5, 0.5)
c.translate(A4[0]/2, A4[1]/2)
c.rotate(45)
c.drawCentredString(0, 0, text)
c.save()
buffer.seek(0)
return PdfReader(buffer)
def apply_watermark(input_path, output_path, watermark_text):
watermark = create_watermark(watermark_text)
reader = PdfReader(input_path)
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark.pages[0])
writer.add_page(page)
with open(output_path, 'wb') as f:
writer.write(f)
Metadata Handling
from pypdf import PdfReader, PdfWriter
# Read metadata
reader = PdfReader('document.pdf')
info = reader.metadata
print(f'Title: {info.title}')
print(f'Author: {info.author}')
print(f'Pages: {len(reader.pages)}')
# Write metadata
writer = PdfWriter()
writer.append(reader)
writer.add_metadata({
'/Title': 'Updated Title',
'/Author': 'Author Name',
'/Subject': 'Document Subject',
'/Creator': 'My Application',
})
with open('updated.pdf', 'wb') as f:
writer.write(f)
Anti-Patterns / Common Mistakes
| Anti-Pattern | Why It Fails | What To Do Instead |
|---|---|---|
| OCR on digital (text-based) PDFs | Slow and inaccurate when text is already extractable | Check if text extracts first, OCR only if empty |
| Not handling encrypted PDFs | Crashes or silent failures | Detect encryption, prompt for password or skip gracefully |
| Loading entire large PDFs into memory | Memory exhaustion on server | Stream pages or process in chunks |
| Ignoring page rotation metadata | Text extraction returns garbled results | Read and apply rotation before extraction |
| Hardcoding page dimensions | Breaks on non-A4 documents | Read dimensions from source PDF |
| Not closing file handles | Resource leaks in long-running processes | Use context managers (with statements) |
| Generating without multi-viewer testing | Rendering differences across viewers | Test in Adobe Reader, Preview, and Chrome |
| Extracting tables without tuning settings | Poor column alignment, merged cells | Adjust table_settings per document type |
Anti-Rationalization Guards
- Do NOT use OCR without first attempting direct text extraction -- check the PDF type.
- Do NOT skip encryption detection -- handle it explicitly even if "most PDFs aren't encrypted."
- Do NOT assume A4 page size -- read dimensions from the source document.
- Do NOT test in only one PDF viewer -- rendering varies across Adobe, Preview, and Chrome.
- Do NOT process large PDFs without memory-conscious patterns (streaming, chunking).
Integration Points
| Skill | How It Connects |
|---|---|
docx-processing |
DOCX-to-PDF conversion pipeline, or choosing between formats |
xlsx-processing |
Data from Excel populates PDF report tables |
email-composer |
Generated PDFs attach to professional emails |
content-research-writer |
Research output formatted as PDF whitepapers |
file-organizer |
Output file naming and directory structure conventions |
deployment |
PDF generation pipelines in server/CI environments |
Skill Type
FLEXIBLE — Select the appropriate library and approach based on the specific PDF task. ReportLab for generation, pdfplumber for extraction, pypdf for manipulation. Combine as needed.