docs-docx
Word Document Parsing
Parse Word documents (.docx) into markdown, JSON, and image artifacts using multi-method extraction.
Usage
Run the parsing script directly:
./scripts/parse_docx.py <path_to_file.docx> <output_dir>
Example:
./scripts/parse_docx.py ~/documents/report.docx ./parsed/
The script uses 4 extraction methods:
- python-docx (basic) - Fast text extraction
- python-docx (detailed) - Full structure with tables
- docx2txt - Simple text-only fallback
- markitdown - Microsoft's markdown converter
Output Structure
output_dir/
├── file.docx/
│ ├── parsing_summary.json
│ ├── python_docx_basic/
│ │ └── content.md
│ ├── python_docx_detailed/
│ │ ├── content.md
│ │ ├── tables.json
│ │ └── images/
│ ├── docx2txt/
│ │ └── content.txt
│ └── markitdown/
│ └── content.md
Script Features
- Self-contained Python script with inline uv metadata
- Handles multiple extraction methods for redundancy
- Creates JSON metadata for tables and document structure
- Extracts images with dimensions and metadata
- Continues on errors (one method failure doesn't stop others)
More from nikhilmaddirala/gtd-cc
tools-catppuccin
Agent skill for creating and validating Catppuccin theme ports
18obsidian-gtd
Obsidian vault management and GTD workflows. Use when integrating with Obsidian vaults, managing notes, organizing knowledge, or supporting Getting Things Done methodology through note-based workflows.
13web-search
General web search patterns and techniques including Gemini CLI coordination. Use this skill when you need to perform web searches, find current information, or research topics online. Covers both Gemini CLI and built-in WebSearch tool usage with precise instruction crafting.
11tools-diagnostics
Interactive system resource analysis and troubleshooting for memory, disk, CPU, and performance issues
11web-content-extraction
Extract documentation and content from websites. Supports Mintlify, Starlight/Astro, Docusaurus, GitBook, ReadTheDocs, Sphinx, and generic sites. Uses a tiered approach - try the simplest method first (direct curl, Jina AI Reader) before falling back to Crawl4AI for JS-heavy sites.
10docs-pdf
Parse PDF documents into repository-friendly markdown and text artifacts. Use when users need to extract text, tables, or structure from PDF files.
10