Word Document Parsing

Parse Word documents (.docx) into markdown, JSON, and image artifacts using multi-method extraction.

Usage

Run the parsing script directly:

./scripts/parse_docx.py <path_to_file.docx> <output_dir>

Example:

./scripts/parse_docx.py ~/documents/report.docx ./parsed/

The script uses 4 extraction methods:

python-docx (basic) - Fast text extraction
python-docx (detailed) - Full structure with tables
docx2txt - Simple text-only fallback
markitdown - Microsoft's markdown converter

Output Structure

output_dir/
├── file.docx/
│   ├── parsing_summary.json
│   ├── python_docx_basic/
│   │   └── content.md
│   ├── python_docx_detailed/
│   │   ├── content.md
│   │   ├── tables.json
│   │   └── images/
│   ├── docx2txt/
│   │   └── content.txt
│   └── markitdown/
│       └── content.md

Script Features

Self-contained Python script with inline uv metadata
Handles multiple extraction methods for redundancy
Creates JSON metadata for tables and document structure
Extracts images with dimensions and metadata
Continues on errors (one method failure doesn't stop others)

docs-docx

Word Document Parsing

Usage

Output Structure

Script Features

More from nikhilmaddirala/gtd-cc

tools-catppuccin

obsidian-gtd

web-search

tools-diagnostics

web-content-extraction

docs-pdf