docs-pdf
PDF Document Parsing
Parse PDF documents into markdown, text, and structured JSON using multi-method extraction.
Usage
Run the parsing script directly:
./scripts/parse_pdf.py <path_to_file.pdf> <output_dir>
Example:
./scripts/parse_pdf.py ~/documents/manual.pdf ./parsed/
The script uses 4 extraction methods:
- pypdf - Basic text extraction with page markers
- pdfminer - Detailed layout preservation
- pdfplumber - Table extraction and structure
- markitdown - Microsoft's markdown converter
Output Structure
output_dir/
├── file.pdf/
│ ├── parsing_summary.json
│ ├── pypdf/
│ │ └── content.md
│ ├── pdfminer/
│ │ └── content.txt
│ ├── pdfplumber/
│ │ ├── content.md
│ │ └── tables.json
│ └── markitdown/
│ └── content.md
Script Features
- Handles text-heavy and table-heavy PDFs
- Preserves layout information where possible
- Extracts tables as structured JSON
- Provides multiple format options (md, txt, json)
- Continues on errors (one method failure doesn't stop others)
Method Selection
- markitdown - Best for AI understanding (continuous markdown, no page breaks)
- pdfplumber - Best for documents with complex tables
- pypdf - Fast fallback for simple text extraction
- pdfminer - Best when layout preservation is critical
More from nikhilmaddirala/gtd-cc
tools-catppuccin
Agent skill for creating and validating Catppuccin theme ports
18obsidian-gtd
Obsidian vault management and GTD workflows. Use when integrating with Obsidian vaults, managing notes, organizing knowledge, or supporting Getting Things Done methodology through note-based workflows.
13web-search
General web search patterns and techniques including Gemini CLI coordination. Use this skill when you need to perform web searches, find current information, or research topics online. Covers both Gemini CLI and built-in WebSearch tool usage with precise instruction crafting.
11tools-diagnostics
Interactive system resource analysis and troubleshooting for memory, disk, CPU, and performance issues
11web-content-extraction
Extract documentation and content from websites. Supports Mintlify, Starlight/Astro, Docusaurus, GitBook, ReadTheDocs, Sphinx, and generic sites. Uses a tiered approach - try the simplest method first (direct curl, Jina AI Reader) before falling back to Crawl4AI for JS-heavy sites.
10productivity-todoist
Fetch and manage Todoist tasks. Use when the user asks about "todoist tasks", "show my tasks", "what's due", "overdue tasks", "triage tasks", or when another skill needs Todoist task context.
10