NYC

pdf

SKILL.md

PDF Processing Skill

You now have expertise in PDF manipulation. Follow these workflows:

Reading PDFs

Option 1: Quick text extraction (preferred)

# Using pdftotext (poppler-utils)
pdftotext input.pdf -  # Output to stdout
pdftotext input.pdf output.txt  # Output to file

# If pdftotext not available, try:
python3 -c "
import fitz  # PyMuPDF
doc = fitz.open('input.pdf')
for page in doc:
    print(page.get_text())
"

Option 2: Page-by-page with metadata

import fitz  # pip install pymupdf

doc = fitz.open("input.pdf")
print(f"Pages: {len(doc)}")
print(f"Metadata: {doc.metadata}")

for i, page in enumerate(doc):
    text = page.get_text()
    print(f"--- Page {i+1} ---")
    print(text)

Creating PDFs

Option 1: From Markdown (recommended)

# Using pandoc
pandoc input.md -o output.pdf

# With custom styling
pandoc input.md -o output.pdf --pdf-engine=xelatex -V geometry:margin=1in

Option 2: Programmatically

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Hello, PDF!")
c.save()

Option 3: From HTML

# Using wkhtmltopdf
wkhtmltopdf input.html output.pdf

# Or with Python
python3 -c "
import pdfkit
pdfkit.from_file('input.html', 'output.pdf')
"

Merging PDFs

import fitz

result = fitz.open()
for pdf_path in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    doc = fitz.open(pdf_path)
    result.insert_pdf(doc)
result.save("merged.pdf")

Splitting PDFs

import fitz

doc = fitz.open("input.pdf")
for i in range(len(doc)):
    single = fitz.open()
    single.insert_pdf(doc, from_page=i, to_page=i)
    single.save(f"page_{i+1}.pdf")

Key Libraries

Task Library Install
Read/Write/Merge PyMuPDF pip install pymupdf
Create from scratch ReportLab pip install reportlab
HTML to PDF pdfkit pip install pdfkit + wkhtmltopdf
Text extraction pdftotext brew install poppler / apt install poppler-utils

Best Practices

  1. Always check if tools are installed before using them
  2. Handle encoding issues - PDFs may contain various character encodings
  3. Large PDFs: Process page by page to avoid memory issues
  4. OCR for scanned PDFs: Use pytesseract if text extraction returns empty
Weekly Installs
2
Repository
smithery/ai
First Seen
6 days ago
Installed on
amp1
opencode1
kimi-cli1
codex1
github-copilot1
claude-code1