PDF Processing Skill

Overview

This skill enables comprehensive PDF operations through Python libraries and command-line tools. Use it for reading, creating, modifying, and analyzing PDF documents.

Quick Start

from pypdf import PdfReader

reader = PdfReader("document.pdf")
for page in reader.pages:
    text = page.extract_text()
    print(text)

When to Use

Converting PDFs to Markdown - Use OpenAI Codex for intelligent conversion (RECOMMENDED FIRST STEP)
Extracting text and metadata from PDF files
Merging multiple PDFs into a single document
Splitting large PDFs into individual pages
Adding watermarks or annotations to PDFs
Password-protecting or decrypting PDFs
Extracting images from PDF documents
OCR processing for scanned documents
Creating new PDFs with reportlab
Extracting tables from structured PDFs

PDF to Markdown Conversion (OpenAI Codex)

IMPORTANT: For all PDF documents, utilize OpenAI Codex to convert contents to .md file first, then use the markdown for further work.

Why Convert to Markdown First?

Better structure preservation - Maintains headings, lists, tables
Easier text processing - Standard markdown format
Improved AI understanding - Codex understands document structure
Format flexibility - Markdown can be converted to any format
Version control friendly - Plain text, diff-friendly

OpenAI Codex Conversion

Prerequisites:

pip install openai pypdf
export OPENAI_API_KEY="your-api-key-here"

Basic Conversion:

import openai
from pypdf import PdfReader
from pathlib import Path

def pdf_to_markdown_codex(pdf_path, output_md_path=None, model="gpt-4"):
    """
    Convert PDF to markdown using OpenAI Codex.

    Args:
        pdf_path: Path to PDF file
        output_md_path: Optional path for output .md file (auto-generated if None)
        model: OpenAI model to use (gpt-4, gpt-3.5-turbo, etc.)

    Returns:
        Path to generated markdown file
    """
    # Extract text from PDF
    reader = PdfReader(pdf_path)
    pdf_text = ""

    for page_num, page in enumerate(reader.pages, 1):
        text = page.extract_text()
        pdf_text += f"\n\n--- Page {page_num} ---\n\n{text}"

    # Generate markdown using OpenAI Codex
    client = openai.OpenAI()

    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": """You are an expert document converter. Convert the provided PDF text
                to well-structured markdown format. Preserve:
                - Document structure (headings, sections)
                - Lists and bullet points
                - Tables (convert to markdown tables)
                - Code blocks and technical content
                - Links and references

                Format the output as clean, readable markdown."""
            },
            {
                "role": "user",
                "content": f"Convert this PDF text to markdown:\n\n{pdf_text}"
            }
        ],
        temperature=0.3,  # Lower temperature for more consistent formatting
    )

    markdown_content = response.choices[0].message.content

    # Save to file
    if output_md_path is None:
        pdf_stem = Path(pdf_path).stem
        output_md_path = Path(pdf_path).parent / f"{pdf_stem}.md"

    # Ensure parent directory exists
    Path(output_md_path).parent.mkdir(parents=True, exist_ok=True)
    Path(output_md_path).write_text(markdown_content, encoding='utf-8')

    return output_md_path

# Usage
md_file = pdf_to_markdown_codex("document.pdf")
print(f"Markdown saved to: {md_file}")

Batch Conversion:

from pathlib import Path
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def batch_pdf_to_markdown(pdf_directory, output_directory=None, model="gpt-4"):
    """
    Convert all PDFs in a directory to markdown.

    Args:
        pdf_directory: Directory containing PDF files
        output_directory: Optional output directory (defaults to pdf_directory/markdown)
        model: OpenAI model to use
    """
    pdf_dir = Path(pdf_directory)

    if output_directory is None:
        output_dir = pdf_dir / "markdown"
    else:
        output_dir = Path(output_directory)

    output_dir.mkdir(parents=True, exist_ok=True)

    pdf_files = list(pdf_dir.glob("*.pdf"))
    total = len(pdf_files)

    logger.info(f"Found {total} PDF files to convert")

    for i, pdf_file in enumerate(pdf_files, 1):
        try:
            output_md = output_dir / f"{pdf_file.stem}.md"

            logger.info(f"[{i}/{total}] Converting {pdf_file.name}...")
            pdf_to_markdown_codex(pdf_file, output_md, model=model)
            logger.info(f"✓ Saved to {output_md.name}")

        except Exception as e:
            logger.error(f"✗ Failed to convert {pdf_file.name}: {e}")

    logger.info(f"\nConversion complete! Files in: {output_dir}")

# Usage
batch_pdf_to_markdown("/path/to/pdfs", model="gpt-4")

Chunked Conversion for Large PDFs:

def pdf_to_markdown_chunked(pdf_path, output_md_path=None,
                            chunk_pages=10, model="gpt-4"):
    """
    Convert large PDF by processing in chunks.

    Args:
        pdf_path: Path to PDF file
        output_md_path: Optional output path
        chunk_pages: Number of pages per chunk
        model: OpenAI model to use
    """
    reader = PdfReader(pdf_path)
    total_pages = len(reader.pages)

    markdown_sections = []

    for start_page in range(0, total_pages, chunk_pages):
        end_page = min(start_page + chunk_pages, total_pages)

        # Extract chunk
        chunk_text = ""
        for page_num in range(start_page, end_page):
            text = reader.pages[page_num].extract_text()
            chunk_text += f"\n\n--- Page {page_num + 1} ---\n\n{text}"

        # Convert chunk
        client = openai.OpenAI()
        response = client.chat.completions.create(
            model=model,
            messages=[
                {
                    "role": "system",
                    "content": "Convert PDF text to markdown. Maintain structure and formatting."
                },
                {
                    "role": "user",
                    "content": f"Convert pages {start_page + 1}-{end_page} to markdown:\n\n{chunk_text}"
                }
            ],
            temperature=0.3,
        )

        markdown_sections.append(response.choices[0].message.content)
        print(f"Processed pages {start_page + 1}-{end_page}/{total_pages}")

    # Combine sections
    full_markdown = "\n\n---\n\n".join(markdown_sections)

    # Save
    if output_md_path is None:
        output_md_path = Path(pdf_path).with_suffix('.md')

    # Ensure parent directory exists
    Path(output_md_path).parent.mkdir(parents=True, exist_ok=True)
    Path(output_md_path).write_text(full_markdown, encoding='utf-8')

    return output_md_path

# Usage
md_file = pdf_to_markdown_chunked("large_document.pdf", chunk_pages=20)

Workflow: PDF → Markdown → Further Processing:

from pathlib import Path

def pdf_workflow(pdf_path):
    """
    Complete workflow: PDF → Markdown → Process markdown.

    Returns:
        dict with paths to original PDF, markdown, and processed content
    """
    # Step 1: Convert PDF to markdown using Codex
    print("Step 1: Converting PDF to markdown...")
    md_path = pdf_to_markdown_codex(pdf_path)

    # Step 2: Read markdown for further processing
    print("Step 2: Reading markdown content...")
    markdown_content = Path(md_path).read_text(encoding='utf-8')

    # Step 3: Further processing (example: extract headings)
    print("Step 3: Processing markdown...")
    headings = [line for line in markdown_content.split('\n') if line.startswith('#')]

    # Step 4: Additional analysis
    word_count = len(markdown_content.split())

    return {
        'pdf_path': pdf_path,
        'markdown_path': md_path,
        'markdown_content': markdown_content,
        'headings': headings,
        'word_count': word_count,
    }

# Usage
result = pdf_workflow("technical_document.pdf")
print(f"Markdown saved: {result['markdown_path']}")
print(f"Found {len(result['headings'])} headings")
print(f"Word count: {result['word_count']}")

# Now work with the markdown
with open(result['markdown_path']) as f:
    markdown = f.read()
    # Do further processing with clean markdown

Cost-Effective Options:

# Use GPT-3.5 for cost savings
md_file = pdf_to_markdown_codex("document.pdf", model="gpt-3.5-turbo")

# Or use local extraction + Codex for formatting only
from pypdf import PdfReader

def hybrid_conversion(pdf_path):
    """Extract text locally, use Codex only for formatting."""
    # Extract text (free)
    reader = PdfReader(pdf_path)
    raw_text = ""
    for page in reader.pages:
        raw_text += page.extract_text()

    # Use Codex just for markdown formatting (lower cost)
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "system",
                "content": "Format the text as markdown. Add appropriate headings, lists, and structure."
            },
            {
                "role": "user",
                "content": raw_text
            }
        ],
        temperature=0.3,
    )

    markdown = response.choices[0].message.content
    output_path = Path(pdf_path).with_suffix('.md')
    output_path.write_text(markdown, encoding='utf-8')

    return output_path

Best Practices:

Always convert to markdown first - Makes downstream processing easier
Use chunking for large PDFs - Avoids token limits and API timeouts
Cache conversions - Store markdown files to avoid re-conversion
Choose model based on complexity - GPT-4 for complex docs, GPT-3.5 for simple ones
Validate output - Check that markdown structure makes sense
Handle errors gracefully - Log failures, continue batch processing

CLI Tool:

#!/usr/bin/env python3
"""PDF to Markdown converter using OpenAI Codex."""

import argparse
import logging
from pathlib import Path
import openai
from pypdf import PdfReader

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def pdf_to_markdown_codex(pdf_path, output_md_path=None, model="gpt-4"):
    """
    Convert PDF to markdown using OpenAI Codex.

    Args:
        pdf_path: Path to PDF file
        output_md_path: Optional path for output .md file (auto-generated if None)
        model: OpenAI model to use (gpt-4, gpt-3.5-turbo, etc.)

    Returns:
        Path to generated markdown file
    """
    # Extract text from PDF
    reader = PdfReader(pdf_path)
    pdf_text = ""

    for page_num, page in enumerate(reader.pages, 1):
        text = page.extract_text()
        pdf_text += f"\n\n--- Page {page_num} ---\n\n{text}"

    # Generate markdown using OpenAI Codex
    client = openai.OpenAI()

    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": """You are an expert document converter. Convert the provided PDF text
                to well-structured markdown format. Preserve:
                - Document structure (headings, sections)
                - Lists and bullet points
                - Tables (convert to markdown tables)
                - Code blocks and technical content
                - Links and references

                Format the output as clean, readable markdown."""
            },
            {
                "role": "user",
                "content": f"Convert this PDF text to markdown:\n\n{pdf_text}"
            }
        ],
        temperature=0.3,
    )

    markdown_content = response.choices[0].message.content

    # Save to file
    if output_md_path is None:
        pdf_stem = Path(pdf_path).stem
        output_md_path = Path(pdf_path).parent / f"{pdf_stem}.md"

    # Ensure parent directory exists
    Path(output_md_path).parent.mkdir(parents=True, exist_ok=True)
    Path(output_md_path).write_text(markdown_content, encoding='utf-8')

    return output_md_path


def batch_pdf_to_markdown(pdf_directory, output_directory=None, model="gpt-4"):
    """
    Convert all PDFs in a directory to markdown.

    Args:
        pdf_directory: Directory containing PDF files
        output_directory: Optional output directory (defaults to pdf_directory/markdown)
        model: OpenAI model to use
    """
    pdf_dir = Path(pdf_directory)

    if output_directory is None:
        output_dir = pdf_dir / "markdown"
    else:
        output_dir = Path(output_directory)

    output_dir.mkdir(parents=True, exist_ok=True)

    pdf_files = list(pdf_dir.glob("*.pdf"))
    total = len(pdf_files)

    logger.info(f"Found {total} PDF files to convert")

    for i, pdf_file in enumerate(pdf_files, 1):
        try:
            output_md = output_dir / f"{pdf_file.stem}.md"

            logger.info(f"[{i}/{total}] Converting {pdf_file.name}...")
            pdf_to_markdown_codex(pdf_file, output_md, model=model)
            logger.info(f"✓ Saved to {output_md.name}")

        except Exception as e:
            logger.error(f"✗ Failed to convert {pdf_file.name}: {e}")

    logger.info(f"\nConversion complete! Files in: {output_dir}")


def main():
    parser = argparse.ArgumentParser(description='Convert PDF to Markdown using OpenAI')
    parser.add_argument('input', help='PDF file or directory')
    parser.add_argument('-o', '--output', help='Output directory or file')
    parser.add_argument('-m', '--model', default='gpt-4', help='OpenAI model (gpt-4, gpt-3.5-turbo)')
    parser.add_argument('--chunk-pages', type=int, default=10, help='Pages per chunk (unused in basic mode)')

    args = parser.parse_args()

    input_path = Path(args.input)

    if input_path.is_file():
        # Single file
        output = args.output or input_path.with_suffix('.md')
        md_path = pdf_to_markdown_codex(input_path, output, model=args.model)
        print(f"✓ Converted: {md_path}")
    else:
        # Directory
        batch_pdf_to_markdown(input_path, args.output, model=args.model)


if __name__ == '__main__':
    main()

Save as pdf2md.py and use:

# Single file
python pdf2md.py document.pdf

# Directory
python pdf2md.py /path/to/pdfs -o /path/to/markdown

# With GPT-3.5 (cheaper)
python pdf2md.py document.pdf --model gpt-3.5-turbo

Python Libraries

pypdf - Core PDF Operations

Merging PDFs:

from pypdf import PdfMerger

merger = PdfMerger()
merger.append("file1.pdf")
merger.append("file2.pdf")
merger.write("merged.pdf")
merger.close()

Splitting PDFs:

from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    writer.write(f"page_{i+1}.pdf")

Extracting Metadata:

reader = PdfReader("document.pdf")
info = reader.metadata
print(f"Author: {info.author}")
print(f"Title: {info.title}")
print(f"Pages: {len(reader.pages)}")

pdfplumber - Advanced Text Extraction

Text with Layout Preservation:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

Table Extraction:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()
    for table in tables:
        for row in table:
            print(row)

reportlab - Creating PDFs

Create PDF from Scratch:

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Hello, World!")
c.showPage()
c.save()

Multi-page Documents:

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("output.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

story.append(Paragraph("Title", styles['Heading1']))
story.append(Paragraph("Body text here.", styles['Normal']))

doc.build(story)

Command-Line Tools

pdftotext (Poppler)

pdftotext document.pdf output.txt
pdftotext -layout document.pdf output.txt  # Preserve layout

qpdf

# Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

# Split pages
qpdf document.pdf --pages . 1-5 -- first_five.pdf

# Decrypt
qpdf --decrypt encrypted.pdf decrypted.pdf

pdftk

# Merge
pdftk file1.pdf file2.pdf cat output merged.pdf

# Split
pdftk document.pdf burst output page_%02d.pdf

# Rotate
pdftk document.pdf cat 1-endeast output rotated.pdf

PDF-Large-Reader - Robust Extraction for Large Files

For large PDFs (100MB+, 1000+ pages), use the pdf-large-reader library with robust table extraction.

Why Use PDF-Large-Reader?

Memory-efficient - Handles 100MB+ PDFs without memory issues
Robust table extraction - Handles irregular tables with column count normalization
Multiple output formats - Generator (streaming), List, or Plain Text
Automatic strategy selection - Intelligent chunk size calculation
Complete extraction - Text, images, tables, and metadata in one pass
High test coverage - 93.58% coverage with 215 tests

Installation

# From the pdf-large-reader repository
cd /mnt/github/workspace-hub/pdf-large-reader
pip install -e .

# Or with extras
pip install -e ".[dev,progress]"

Quick Start

from pdf_large_reader import process_large_pdf, extract_text_only, extract_everything

# Simple text extraction
text = extract_text_only("large_document.pdf")
print(text)

# Process with automatic strategy selection
pages = process_large_pdf(
    "large_document.pdf",
    output_format="list",
    extract_images=True,
    extract_tables=True
)

# Memory-efficient streaming for very large files
for page in process_large_pdf("huge_file.pdf", output_format="generator"):
    print(f"Page {page.page_number}: {len(page.text)} characters")

Robust Table Extraction

NEW: Column Count Normalization (v1.3.0+)

The table extraction now handles irregular tables with different column counts:

from pdf_large_reader import extract_everything

# Extract everything including tables with robust error handling
pages = extract_everything("technical_standard.pdf")

for page in pages:
    if 'tables' in page.metadata:
        tables = page.metadata['tables']
        print(f"Page {page.page_number}: Found {len(tables)} tables")

        for i, table_df in enumerate(tables):
            print(f"  Table {i+1}: {table_df.shape[0]} rows x {table_df.shape[1]} cols")
            print(table_df.head())

How It Works:

Detects table-like structures from text positioning
Normalizes column counts across all rows
Pads short rows with empty strings
Gracefully handles malformed tables with try-except
Logs warnings instead of crashing

Typical Performance:

API Std 650 (28 MB, 461 pages): 14,648 chars/sec, 5.18 pages/sec
API RP 579 (41 MB, 966 pages): 2,090 chars/sec, 8.48 pages/sec

Command Line Usage

# Extract text from PDF
pdf-large-reader document.pdf

# Save to file
pdf-large-reader document.pdf --output result.txt

# Extract with images and tables
pdf-large-reader document.pdf --extract-images --extract-tables

# Use generator format for large files
pdf-large-reader huge.pdf --output-format generator

# Verbose output
pdf-large-reader document.pdf --verbose

API Reference

# Main entry point with automatic strategy
process_large_pdf(
    pdf_path,
    output_format="generator",    # "generator" (default), "list", or "text"
    extract_images=False,         # Extract images
    extract_tables=False,         # Extract tables with normalization
    chunk_size=None,              # Auto-calculated if None
    fallback_api_key=None,        # OpenAI API key for complex pages
    fallback_model="gpt-4o",      # Model for fallback extraction
    progress_callback=None,       # Progress tracking function
    auto_strategy=True            # Enable automatic strategy selection
)

# Quick text extraction
extract_text_only(pdf_path) -> str

# Extract with images
extract_pages_with_images(pdf_path) -> List[PDFPage]

# Extract with tables
extract_pages_with_tables(pdf_path) -> List[PDFPage]

# Extract everything
extract_everything(pdf_path) -> List[PDFPage]

PDFPage Data Structure

@dataclass
class PDFPage:
    page_number: int          # Page number (1-indexed)
    text: str                 # Extracted text from page
    images: List[dict]        # Extracted images with metadata
    metadata: dict            # Page metadata including tables

Performance Benchmarks

Tested on Ubuntu 22.04, Python 3.11, 16GB RAM:

File Size	Pages	Time	Memory	Strategy
5 MB	10	< 5s	~50 MB	batch_all
50 MB	100	< 30s	~150 MB	chunked
100 MB	500	< 60s	~200 MB	stream_pages
200 MB	1000	< 2min	~250 MB	stream_pages

Real-World Validation

Tested with actual API standards:

✅ API RP 579 (2000) - 41 MB, 966 pages
✅ API Std 650 (2001) - 28 MB, 461 pages
✅ All extraction methods working (text, auto strategy, generator, complete)
✅ Table extraction with column normalization
✅ Image extraction (461-966 images per document)

Common Tasks

OCR for Scanned Documents

import pytesseract
from pdf2image import convert_from_path

images = convert_from_path("scanned.pdf")
for i, image in enumerate(images):
    text = pytesseract.image_to_string(image)
    print(f"Page {i+1}:\n{text}")

Add Watermark

from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
watermark = PdfReader("watermark.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.merge_page(watermark.pages[0])
    writer.add_page(page)

writer.write("watermarked.pdf")

Extract Images

from pypdf import PdfReader

reader = PdfReader("document.pdf")
for page_num, page in enumerate(reader.pages):
    for img_num, image in enumerate(page.images):
        with open(f"image_{page_num}_{img_num}.png", "wb") as f:
            f.write(image.data)

Password Protection

from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

writer.encrypt("user_password", "owner_password")
writer.write("protected.pdf")

Execution Checklist

Verify input PDF exists and is readable
Check if PDF is encrypted/DRM-protected
Choose appropriate library for task (pypdf vs pdfplumber)
Handle multi-page documents correctly
Validate output file was created
Clean up temporary files

Error Handling

Common Errors

Error: FileNotFoundError

Cause: PDF file path is incorrect
Solution: Verify file path and ensure file exists

Error: PdfReadError (encrypted)

Cause: PDF is password-protected or DRM-encrypted
Solution: Provide password or use qpdf to decrypt

Error: Empty text extraction

Cause: PDF contains scanned images, not text
Solution: Use OCR with pytesseract and pdf2image

Error: DependencyError (Tesseract)

Cause: Tesseract OCR not installed
Solution: sudo apt-get install tesseract-ocr or brew install tesseract

Metrics

Metric	Typical Value
Text extraction speed	~50 pages/second
OCR processing speed	~2-5 pages/minute
Memory usage (pypdf)	~10MB per 100 pages
Merge operation	~100 PDFs/second

Quick Reference

Task	Tool
Read text	pypdf, pdfplumber
Extract tables	pdfplumber
Create PDFs	reportlab
Merge/split	pypdf, qpdf, pdftk
OCR	pytesseract + pdf2image
Fill forms	pypdf, pdfrw
Watermark	pypdf
Encrypt/decrypt	pypdf, qpdf

Dependencies

# Core PDF libraries
pip install pypdf pdfplumber reportlab pytesseract pdf2image

# OpenAI Codex for PDF to Markdown conversion
pip install openai

System tools:

Poppler (pdftotext, pdftoppm)
qpdf
pdftk
Tesseract OCR

Environment variables:

export OPENAI_API_KEY="your-api-key-here"

Version History

1.2.2 (2026-01-04): Fixed P2 issue - added parents=True to all mkdir() calls to handle nested output paths; prevents FileNotFoundError when creating directories with non-existent parent paths
1.2.1 (2026-01-04): Fixed CLI tool missing imports - added complete standalone script with all required imports (openai, pypdf, logging) and function definitions; resolved P1 issue from Codex review
1.2.0 (2026-01-04): MAJOR UPDATE - Added OpenAI Codex integration for PDF-to-Markdown conversion as recommended first step for all PDF processing; includes batch conversion, chunking for large files, cost-effective options, and complete CLI tool
1.1.0 (2026-01-02): Added Quick Start, When to Use, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
1.0.0 (2024-10-15): Initial release with pypdf, pdfplumber, reportlab, CLI tools