PDF Processing

This skill provides tools and guidance for extracting content from PDF documents.

Quick Start

Use pdfplumber to extract text:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    text = pdf.pages[0].extract_text()

Installation

Install the required dependencies:

pip install pdfplumber

Basic Text Extraction

For simple text extraction from a PDF:

import pdfplumber

def extract_text(pdf_path):
    """Extract all text from a PDF file."""
    text = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text.append(page_text)
    return "\n\n".join(text)

Table Extraction

For extracting tables from PDFs:

import pdfplumber

def extract_tables(pdf_path):
    """Extract all tables from a PDF file."""
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables()
            tables.extend(page_tables)
    return tables

Form Filling

For filling PDF forms, see references/FORMS.md.

Advanced Table Extraction

For complex tables with merged cells, see references/TABLES.md and run scripts/extract.py.