pdf processing pro

SKILL.md

PDF Processing Pro

Production-ready PDF processing toolkit with pre-built scripts, comprehensive error handling, and support for complex workflows.

Quick start

Extract text from PDF

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    print(text)

Analyze PDF form (using included script)

python scripts/analyze_form.py input.pdf --output fields.json
# Returns: JSON with all form fields, types, and positions

Fill PDF form with validation

python scripts/fill_form.py input.pdf data.json output.pdf
# Validates all fields before filling, includes error reporting

Extract tables from PDF

python scripts/extract_tables.py report.pdf --output tables.csv
# Extracts all tables with automatic column detection

Features

✅ Production-ready scripts

All scripts include:

  • Error handling: Graceful failures with detailed error messages
  • Validation: Input validation and type checking
  • Logging: Configurable logging with timestamps
  • Type hints: Full type annotations for IDE support
  • CLI interface: --help flag for all scripts
  • Exit codes: Proper exit codes for automation

✅ Comprehensive workflows

  • PDF Forms: Complete form processing pipeline
  • Table Extraction: Advanced table detection and extraction
  • OCR Processing: Scanned PDF text extraction
  • Batch Operations: Process multiple PDFs efficiently
  • Validation: Pre and post-processing validation

Advanced topics

PDF Form Processing

For complete form workflows including:

  • Field analysis and detection
  • Dynamic form filling
  • Validation rules
  • Multi-page forms
  • Checkbox and radio button handling

See FORMS.md

Table Extraction

For complex table extraction:

  • Multi-page tables
  • Merged cells
  • Nested tables
  • Custom table detection
  • Export to CSV/Excel

See TABLES.md

OCR Processing

For scanned PDFs and image-based documents:

  • Tesseract integration
  • Language support
  • Image preprocessing
  • Confidence scoring
  • Batch OCR

See OCR.md

Included scripts

Form processing

analyze_form.py - Extract form field information

python scripts/analyze_form.py input.pdf [--output fields.json] [--verbose]

fill_form.py - Fill PDF forms with data

python scripts/fill_form.py input.pdf data.json output.pdf [--validate]

validate_form.py - Validate form data before filling

python scripts/validate_form.py data.json schema.json

Table extraction

extract_tables.py - Extract tables to CSV/Excel

python scripts/extract_tables.py input.pdf [--output tables.csv] [--format csv|excel]

Text extraction

extract_text.py - Extract text with formatting preservation

python scripts/extract_text.py input.pdf [--output text.txt] [--preserve-formatting]

Utilities

merge_pdfs.py - Merge multiple PDFs

python scripts/merge_pdfs.py file1.pdf file2.pdf file3.pdf --output merged.pdf

split_pdf.py - Split PDF into individual pages

python scripts/split_pdf.py input.pdf --output-dir pages/

validate_pdf.py - Validate PDF integrity

python scripts/validate_pdf.py input.pdf

Common workflows

Workflow 1: Process form submissions

# 1. Analyze form structure
python scripts/analyze_form.py template.pdf --output schema.json

# 2. Validate submission data
python scripts/validate_form.py submission.json schema.json

# 3. Fill form
python scripts/fill_form.py template.pdf submission.json completed.pdf

# 4. Validate output
python scripts/validate_pdf.py completed.pdf

Workflow 2: Extract data from reports

# 1. Extract tables
python scripts/extract_tables.py monthly_report.pdf --output data.csv

# 2. Extract text for analysis
python scripts/extract_text.py monthly_report.pdf --output report.txt

Workflow 3: Batch processing

import glob
from pathlib import Path
import subprocess

# Process all PDFs in directory
for pdf_file in glob.glob("invoices/*.pdf"):
    output_file = Path("processed") / Path(pdf_file).name

    result = subprocess.run([
        "python", "scripts/extract_text.py",
        pdf_file,
        "--output", str(output_file)
    ], capture_output=True)

    if result.returncode == 0:
        print(f"✓ Processed: {pdf_file}")
    else:
        print(f"✗ Failed: {pdf_file} - {result.stderr}")

Error handling

All scripts follow consistent error patterns:

# Exit codes
# 0 - Success
# 1 - File not found
# 2 - Invalid input
# 3 - Processing error
# 4 - Validation error

# Example usage in automation
result = subprocess.run(["python", "scripts/fill_form.py", ...])

if result.returncode == 0:
    print("Success")
elif result.returncode == 4:
    print("Validation failed - check input data")
else:
    print(f"Error occurred: {result.returncode}")

Dependencies

All scripts require:

pip install pdfplumber pypdf pillow pytesseract pandas

Optional for OCR:

# Install tesseract-ocr system package
# macOS: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr
# Windows: Download from GitHub releases

Performance tips

  • Use batch processing for multiple PDFs
  • Enable multiprocessing with --parallel flag (where supported)
  • Cache extracted data to avoid re-processing
  • Validate inputs early to fail fast
  • Use streaming for large PDFs (>50MB)

Best practices

  1. Always validate inputs before processing
  2. Use try-except in custom scripts
  3. Log all operations for debugging
  4. Test with sample PDFs before production
  5. Set timeouts for long-running operations
  6. Check exit codes in automation
  7. Backup originals before modification

Troubleshooting

Common issues

"Module not found" errors:

pip install -r requirements.txt

Tesseract not found:

# Install tesseract system package (see Dependencies)

Memory errors with large PDFs:

# Process page by page instead of loading entire PDF
with pdfplumber.open("large.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        # Process page immediately

Permission errors:

chmod +x scripts/*.py

Getting help

All scripts support --help:

python scripts/analyze_form.py --help
python scripts/extract_tables.py --help

For detailed documentation on specific topics, see:

Weekly Installs
61
Installed on
claude-code45
opencode39
gemini-cli39
antigravity36
cursor33
codex31