docs-pdf

SKILL.md

PDF Document Parsing

Parse PDF documents into markdown, text, and structured JSON using multi-method extraction.

Usage

Run the parsing script directly:

./scripts/parse_pdf.py <path_to_file.pdf> <output_dir>

Example:

./scripts/parse_pdf.py ~/documents/manual.pdf ./parsed/

The script uses 4 extraction methods:

  • pypdf - Basic text extraction with page markers
  • pdfminer - Detailed layout preservation
  • pdfplumber - Table extraction and structure
  • markitdown - Microsoft's markdown converter

Output Structure

output_dir/
├── file.pdf/
│   ├── parsing_summary.json
│   ├── pypdf/
│   │   └── content.md
│   ├── pdfminer/
│   │   └── content.txt
│   ├── pdfplumber/
│   │   ├── content.md
│   │   └── tables.json
│   └── markitdown/
│       └── content.md

Script Features

  • Handles text-heavy and table-heavy PDFs
  • Preserves layout information where possible
  • Extracts tables as structured JSON
  • Provides multiple format options (md, txt, json)
  • Continues on errors (one method failure doesn't stop others)

Method Selection

  • markitdown - Best for AI understanding (continuous markdown, no page breaks)
  • pdfplumber - Best for documents with complex tables
  • pypdf - Fast fallback for simple text extraction
  • pdfminer - Best when layout preservation is critical
Weekly Installs
4
GitHub Stars
1
First Seen
14 days ago
Installed on
claude-code4
mcpjam1
kilo1
junie1
windsurf1
zencoder1