pdf-processing
PDF Processing Skill
Overview
This skill provides comprehensive PDF processing capabilities leveraging Claude's native PDF support, which can:
- Extract and parse text content
- Identify and extract tables with structure preservation
- Analyze charts, graphs, and visual elements
- Understand document layout and formatting
- Generate summaries and insights
- Convert PDFs to various formats (Markdown, JSON, CSV, plain text)
When to Use
Claude should automatically activate this skill when:
- User provides a PDF file path or wants to process a PDF
- User asks to extract text, tables, or data from PDFs
- User requests PDF analysis, summarization, or conversion
- User needs to understand charts, diagrams, or visual content in PDFs
- User wants to transform PDF content to another format
Key Capabilities
1. Text Extraction
- Full document text extraction with formatting preservation
- Page-by-page text extraction
- Section and paragraph identification
- Header and footer detection
2. Table Extraction
- Automatic table detection and extraction
- Structure-preserving conversion to CSV, JSON, or Markdown
- Multi-page table handling
- Cell merging and complex table support
3. Visual Content Analysis
- Chart and graph interpretation
- Diagram and flowchart understanding
- Image and figure description
- Infographic analysis
4. Document Understanding
- Layout analysis and structure detection
- Multi-column text handling
- Form field identification
- Metadata extraction (title, author, creation date, page count)
5. Format Conversion
- PDF to Markdown (preserving headings, lists, tables)
- PDF to JSON (structured data extraction)
- PDF to CSV (table extraction)
- PDF to plain text
Instructions
Step 1: Validate PDF Input
First, determine how the PDF is provided:
Option A: File Path
# Verify the file exists and is a PDF
ls -lh /path/to/document.pdf
file /path/to/document.pdf
Option B: Base64-Encoded PDF If the user provides base64-encoded content, save it first:
# Decode and save base64 PDF
python .claude/skills/pdf-processing/scripts/decode_pdf.py --input base64_string.txt --output document.pdf
Option C: URL If the user provides a URL, download it:
# Download PDF from URL
python .claude/skills/pdf-processing/scripts/download_pdf.py --url "https://example.com/doc.pdf" --output document.pdf
Step 2: Read and Analyze PDF
Use the Read tool to access PDF files. Claude's native PDF support will:
- Display the PDF content visually
- Extract text and structure automatically
- Identify tables, charts, and images
# Example: Read PDF using the Read tool
# The Read tool handles PDFs natively and extracts text + visual content
Read(file_path="/absolute/path/to/document.pdf")
Step 3: Process Based on User Request
A. Text Extraction
For simple text extraction:
python .claude/skills/pdf-processing/scripts/extract_text.py \
--input document.pdf \
--output document.txt \
--preserve-formatting true
For page-specific extraction:
python .claude/skills/pdf-processing/scripts/extract_text.py \
--input document.pdf \
--pages 1,3,5-10 \
--output selected_pages.txt
B. Table Extraction
Extract all tables to CSV:
python .claude/skills/pdf-processing/scripts/extract_tables.py \
--input document.pdf \
--format csv \
--output-dir ./extracted_tables/
Extract tables to JSON with structure:
python .claude/skills/pdf-processing/scripts/extract_tables.py \
--input document.pdf \
--format json \
--output tables.json
C. Document Summarization
Generate a summary of the PDF:
python .claude/skills/pdf-processing/scripts/summarize_pdf.py \
--input document.pdf \
--output summary.md \
--style concise # Options: concise, detailed, executive
D. Format Conversion
Convert PDF to Markdown:
python .claude/skills/pdf-processing/scripts/convert_pdf.py \
--input document.pdf \
--output document.md \
--format markdown \
--preserve-images true
Convert PDF to structured JSON:
python .claude/skills/pdf-processing/scripts/convert_pdf.py \
--input document.pdf \
--output document.json \
--format json \
--extract-metadata true
E. Visual Content Analysis
Analyze charts and graphs:
python .claude/skills/pdf-processing/scripts/analyze_visuals.py \
--input document.pdf \
--output analysis.json \
--elements charts,graphs,diagrams
Step 4: Post-Processing and Output
After extraction/conversion:
- Validate Output: Check that the output file was created successfully
ls -lh output_file.{txt,md,json,csv}
- Preview Results: Show the user a preview of the extracted content
head -n 20 output_file.txt # For text files
cat output_file.json | python -m json.tool | head -n 50 # For JSON
- Provide Summary: Summarize what was extracted and offer next steps
Step 5: Handle Edge Cases
Password-Protected PDFs
python .claude/skills/pdf-processing/scripts/extract_text.py \
--input document.pdf \
--password "user_provided_password" \
--output document.txt
Scanned PDFs (OCR Required)
# Use OCR for scanned PDFs
python .claude/skills/pdf-processing/scripts/ocr_pdf.py \
--input scanned_document.pdf \
--output document.txt \
--language eng # Language code: eng, fra, deu, etc.
Large PDFs (Memory Optimization)
# Process large PDFs in chunks
python .claude/skills/pdf-processing/scripts/extract_text.py \
--input large_document.pdf \
--output document.txt \
--chunk-size 10 # Process 10 pages at a time
Error Handling
Common Issues
-
File Not Found
- Verify the path with
lsorGlob - Check for typos in the filename
- Ensure absolute paths are used
- Verify the path with
-
Corrupted PDF
- Try reading with the Read tool first
- Use repair mode:
python scripts/repair_pdf.py --input corrupted.pdf --output repaired.pdf
-
Unsupported PDF Features
- Some PDFs with complex DRM or encryption may fail
- Inform the user and suggest alternatives
-
OCR Failures
- Check if tesseract is installed:
which tesseract - Verify image quality is sufficient
- Try different language settings
- Check if tesseract is installed:
Best Practices
- Always use the Read tool first - This leverages Claude's native PDF support for best results
- Preserve structure - When extracting tables or converting formats, maintain the original structure
- Validate outputs - Always check that output files were created successfully
- Provide context - Tell the user what was extracted and what they can do next
- Handle errors gracefully - If processing fails, explain why and suggest alternatives
- Respect privacy - Remind users not to upload sensitive documents without proper authorization
Output Formats
Text Output
Plain text with optional formatting preservation
Line breaks and paragraphs maintained
Special characters preserved
Markdown Output
# Document Title
## Section Heading
Paragraph text with **bold** and *italic* formatting.
| Column 1 | Column 2 | Column 3 |
|----------|----------|----------|
| Data 1 | Data 2 | Data 3 |

JSON Output
{
"metadata": {
"title": "Document Title",
"author": "Author Name",
"pages": 42,
"creation_date": "2024-01-15"
},
"content": [
{
"page": 1,
"type": "text",
"content": "Page 1 text content..."
},
{
"page": 2,
"type": "table",
"headers": ["Col1", "Col2", "Col3"],
"rows": [["A", "B", "C"], ["D", "E", "F"]]
}
]
}
CSV Output (for tables)
Column1,Column2,Column3
Value1,Value2,Value3
Value4,Value5,Value6
Advanced Features
Batch Processing
# Process multiple PDFs
python .claude/skills/pdf-processing/scripts/batch_process.py \
--input-dir ./pdfs/ \
--output-dir ./extracted/ \
--format markdown
Custom Templates
# Use custom conversion templates
python .claude/skills/pdf-processing/scripts/convert_pdf.py \
--input document.pdf \
--output document.md \
--template .claude/skills/pdf-processing/assets/custom_template.md
Selective Extraction
# Extract only specific sections
python .claude/skills/pdf-processing/scripts/extract_sections.py \
--input document.pdf \
--sections "Introduction,Methods,Results" \
--output extracted_sections.md
Integration with Other Tools
This skill works well with:
- Data analysis tools - Extract tables and feed to pandas/numpy
- Documentation generators - Convert PDFs to Markdown for wikis
- Search systems - Extract text for indexing
- Automation workflows - Batch process invoices, reports, forms
Examples
Example 1: Extract and Summarize
# User: "Please extract the key points from this research paper"
1. Read(file_path="/path/to/paper.pdf")
2. python scripts/summarize_pdf.py --input paper.pdf --output summary.md --style executive
3. Show the user the summary with key findings highlighted
Example 2: Extract Tables to CSV
# User: "Get all tables from this financial report"
1. Read(file_path="/path/to/report.pdf")
2. python scripts/extract_tables.py --input report.pdf --format csv --output-dir ./tables/
3. List the extracted CSV files and preview the first table
Example 3: Convert to Markdown
# User: "Convert this PDF to markdown"
1. Read(file_path="/path/to/document.pdf")
2. python scripts/convert_pdf.py --input document.pdf --output document.md --format markdown
3. Show preview of the markdown and confirm successful conversion
Dependencies
The scripts in this skill require:
- Python 3.8+
- PyPDF2 or pypdf (PDF parsing)
- pdfplumber (table extraction)
- pdf2image (image extraction)
- pytesseract (OCR for scanned PDFs)
- Pillow (image processing)
- requests (URL downloads)
These are installed via the requirements file in assets/requirements.txt.
References
See the references/ directory for:
pdf_capabilities.md- Detailed breakdown of Claude's PDF supportapi_reference.md- Complete API documentation for all scriptsexamples.md- More usage examples and use casestroubleshooting.md- Common issues and solutions
Notes for Claude
- Always read PDFs with the Read tool first - This is the most reliable method
- After reading, analyze what the user needs - Text, tables, summary, conversion?
- Use the appropriate script - Don't try to do everything manually
- Validate outputs - Always check that files were created successfully
- Provide helpful context - Explain what was extracted and suggest next steps
- Handle errors gracefully - If something fails, explain why and offer alternatives
- Be efficient - Use batch processing for multiple PDFs
- Preserve structure - Maintain document formatting when converting
Version History
- 1.0.0 (2025-11-18) - Initial release with core PDF processing capabilities