docx

SKILL.md

Word Document (DOCX) Processing Skill

When working with Word documents, follow these guidelines:

1. Reading & Extracting from DOCX

Extract text content:

# Using pandoc
pandoc document.docx -t plain -o output.txt

# Using python-docx
python3 -c "
import docx
doc = docx.Document('document.docx')
for para in doc.paragraphs:
    print(para.text)
"

# Using antiword (for .doc files)
antiword document.doc > output.txt

Extract with formatting:

# Convert to markdown (preserves structure)
pandoc document.docx -t markdown -o output.md

# Convert to HTML
pandoc document.docx -t html -o output.html

Extract images:

# DOCX files are ZIP archives
unzip -j document.docx 'word/media/*' -d extracted_images/

# Or using python
python3 << 'EOF'
import docx
import os

doc = docx.Document('document.docx')
os.makedirs('images', exist_ok=True)

for i, rel in enumerate(doc.part.rels.values()):
    if "image" in rel.target_ref:
        img = rel.target_part.blob
        with open(f'images/image_{i}.png', 'wb') as f:
            f.write(img)
EOF

2. Creating DOCX Files

From plain text:

# Using pandoc
pandoc input.txt -o output.docx

From markdown:

# Basic conversion
pandoc input.md -o output.docx

# With custom styling
pandoc input.md --reference-doc=template.docx -o output.docx

From HTML:

pandoc input.html -o output.docx

Using Python:

from docx import Document
from docx.shared import Inches, Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH

# Create new document
doc = Document()

# Add heading
doc.add_heading('Document Title', 0)

# Add paragraph
p = doc.add_paragraph('This is a paragraph with ')
p.add_run('bold').bold = True
p.add_run(' and ')
p.add_run('italic').italic = True
p.add_run(' text.')

# Add table
table = doc.add_table(rows=3, cols=3)
table.style = 'Light Grid Accent 1'

# Add image
doc.add_picture('image.png', width=Inches(4))

# Save
doc.save('output.docx')

3. Converting DOCX

DOCX to PDF:

# Using LibreOffice (headless)
libreoffice --headless --convert-to pdf document.docx

# With output directory
libreoffice --headless --convert-to pdf --outdir ./output document.docx

# Using pandoc (requires LaTeX)
pandoc document.docx -o output.pdf

DOCX to HTML:

# Using pandoc
pandoc document.docx -o output.html

# With standalone HTML
pandoc document.docx -s -o output.html

DOCX to Markdown:

# Clean markdown output
pandoc document.docx -t markdown -o output.md

# GitHub-flavored markdown
pandoc document.docx -t gfm -o output.md

DOC to DOCX:

# Using LibreOffice
libreoffice --headless --convert-to docx document.doc

4. Template Processing

Mail merge / Variable substitution:

from docx import Document

def fill_template(template_path, output_path, data):
    doc = Document(template_path)

    # Replace in paragraphs
    for paragraph in doc.paragraphs:
        for key, value in data.items():
            if f'{{{key}}}' in paragraph.text:
                paragraph.text = paragraph.text.replace(f'{{{key}}}', str(value))

    # Replace in tables
    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                for key, value in data.items:
                    if f'{{{key}}}' in cell.text:
                        cell.text = cell.text.replace(f'{{{key}}}', str(value))

    doc.save(output_path)

# Usage
data = {
    'name': 'John Doe',
    'date': '2026-01-22',
    'company': 'Acme Corp'
}
fill_template('template.docx', 'output.docx', data)

5. Analyzing DOCX Structure

Get document stats:

from docx import Document

doc = Document('document.docx')

print(f"Paragraphs: {len(doc.paragraphs)}")
print(f"Tables: {len(doc.tables)}")
print(f"Sections: {len(doc.sections)}")

# Word count
text = ' '.join([p.text for p in doc.paragraphs])
print(f"Words: {len(text.split())}")

Extract all headings:

from docx import Document

doc = Document('document.docx')

for para in doc.paragraphs:
    if para.style.name.startswith('Heading'):
        print(f"{para.style.name}: {para.text}")

List styles used:

from docx import Document

doc = Document('document.docx')
styles = set(p.style.name for p in doc.paragraphs)
print("Styles used:", styles)

6. Modifying Existing DOCX

Add content to existing document:

from docx import Document

doc = Document('existing.docx')

# Add new paragraph
doc.add_paragraph('New paragraph added')

# Add page break
doc.add_page_break()

# Save
doc.save('modified.docx')

Replace text globally:

from docx import Document

def replace_text(doc_path, search, replace):
    doc = Document(doc_path)

    for paragraph in doc.paragraphs:
        if search in paragraph.text:
            paragraph.text = paragraph.text.replace(search, replace)

    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                if search in cell.text:
                    cell.text = cell.text.replace(search, replace)

    doc.save(doc_path)

7. Working with Tables

Extract table data:

from docx import Document

doc = Document('document.docx')

for i, table in enumerate(doc.tables):
    print(f"\nTable {i+1}:")
    for row in table.rows:
        cells = [cell.text for cell in row.cells]
        print('\t'.join(cells))

Create formatted table:

from docx import Document

doc = Document()

# Create table
table = doc.add_table(rows=1, cols=3)
table.style = 'Light List Accent 1'

# Header row
header_cells = table.rows[0].cells
header_cells[0].text = 'Name'
header_cells[1].text = 'Age'
header_cells[2].text = 'City'

# Data rows
data = [
    ('John', '30', 'New York'),
    ('Jane', '25', 'London')
]

for name, age, city in data:
    row = table.add_row().cells
    row[0].text = name
    row[1].text = age
    row[2].text = city

doc.save('table.docx')

8. Common Workflows

Create report from data

# Generate from JSON data
python3 << 'EOF'
import json
from docx import Document
from docx.shared import Inches

# Load data
with open('data.json') as f:
    data = json.load(f)

# Create document
doc = Document()
doc.add_heading(data['title'], 0)

for section in data['sections']:
    doc.add_heading(section['heading'], 1)
    doc.add_paragraph(section['content'])

doc.save('report.docx')
EOF

Batch convert DOC to DOCX

# Convert all .doc files in directory
for file in *.doc; do
    libreoffice --headless --convert-to docx "$file"
done

Extract all links

from docx import Document

doc = Document('document.docx')

for paragraph in doc.paragraphs:
    for run in paragraph.runs:
        if run.font.underline and 'http' in run.text:
            print(run.text)

Tools Required

Install necessary tools:

Linux (Ubuntu/Debian):

sudo apt-get install pandoc libreoffice python3-pip antiword
pip3 install python-docx

macOS:

brew install pandoc libreoffice
pip3 install python-docx

Windows:

# Install Pandoc and LibreOffice manually
pip install python-docx

Security Notes

  • ✅ Validate file paths before processing
  • ✅ Check file sizes to prevent memory issues
  • ✅ Sanitize user input in templates
  • ✅ Be cautious with macro-enabled documents (.docm)
  • ✅ Scan documents from untrusted sources
  • ✅ Don't execute embedded scripts automatically

When to Use This Skill

Use /docx when the user:

  • Wants to read or extract text from Word documents
  • Needs to create DOCX files from templates or data
  • Wants to convert DOCX to other formats (PDF, HTML, Markdown)
  • Asks to process mail merge or template variables
  • Needs to extract tables, images, or structure from documents
  • Wants to batch process multiple Word files
  • Asks to modify existing Word documents

Always confirm before overwriting existing files.

Weekly Installs
2
GitHub Stars
2
First Seen
Mar 1, 2026
Installed on
opencode2
gemini-cli2
codebuddy2
github-copilot2
codex2
kimi-cli2