docx
SKILL.md
Word Document (DOCX) Processing Skill
When working with Word documents, follow these guidelines:
1. Reading & Extracting from DOCX
Extract text content:
# Using pandoc
pandoc document.docx -t plain -o output.txt
# Using python-docx
python3 -c "
import docx
doc = docx.Document('document.docx')
for para in doc.paragraphs:
print(para.text)
"
# Using antiword (for .doc files)
antiword document.doc > output.txt
Extract with formatting:
# Convert to markdown (preserves structure)
pandoc document.docx -t markdown -o output.md
# Convert to HTML
pandoc document.docx -t html -o output.html
Extract images:
# DOCX files are ZIP archives
unzip -j document.docx 'word/media/*' -d extracted_images/
# Or using python
python3 << 'EOF'
import docx
import os
doc = docx.Document('document.docx')
os.makedirs('images', exist_ok=True)
for i, rel in enumerate(doc.part.rels.values()):
if "image" in rel.target_ref:
img = rel.target_part.blob
with open(f'images/image_{i}.png', 'wb') as f:
f.write(img)
EOF
2. Creating DOCX Files
From plain text:
# Using pandoc
pandoc input.txt -o output.docx
From markdown:
# Basic conversion
pandoc input.md -o output.docx
# With custom styling
pandoc input.md --reference-doc=template.docx -o output.docx
From HTML:
pandoc input.html -o output.docx
Using Python:
from docx import Document
from docx.shared import Inches, Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
# Create new document
doc = Document()
# Add heading
doc.add_heading('Document Title', 0)
# Add paragraph
p = doc.add_paragraph('This is a paragraph with ')
p.add_run('bold').bold = True
p.add_run(' and ')
p.add_run('italic').italic = True
p.add_run(' text.')
# Add table
table = doc.add_table(rows=3, cols=3)
table.style = 'Light Grid Accent 1'
# Add image
doc.add_picture('image.png', width=Inches(4))
# Save
doc.save('output.docx')
3. Converting DOCX
DOCX to PDF:
# Using LibreOffice (headless)
libreoffice --headless --convert-to pdf document.docx
# With output directory
libreoffice --headless --convert-to pdf --outdir ./output document.docx
# Using pandoc (requires LaTeX)
pandoc document.docx -o output.pdf
DOCX to HTML:
# Using pandoc
pandoc document.docx -o output.html
# With standalone HTML
pandoc document.docx -s -o output.html
DOCX to Markdown:
# Clean markdown output
pandoc document.docx -t markdown -o output.md
# GitHub-flavored markdown
pandoc document.docx -t gfm -o output.md
DOC to DOCX:
# Using LibreOffice
libreoffice --headless --convert-to docx document.doc
4. Template Processing
Mail merge / Variable substitution:
from docx import Document
def fill_template(template_path, output_path, data):
doc = Document(template_path)
# Replace in paragraphs
for paragraph in doc.paragraphs:
for key, value in data.items():
if f'{{{key}}}' in paragraph.text:
paragraph.text = paragraph.text.replace(f'{{{key}}}', str(value))
# Replace in tables
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
for key, value in data.items:
if f'{{{key}}}' in cell.text:
cell.text = cell.text.replace(f'{{{key}}}', str(value))
doc.save(output_path)
# Usage
data = {
'name': 'John Doe',
'date': '2026-01-22',
'company': 'Acme Corp'
}
fill_template('template.docx', 'output.docx', data)
5. Analyzing DOCX Structure
Get document stats:
from docx import Document
doc = Document('document.docx')
print(f"Paragraphs: {len(doc.paragraphs)}")
print(f"Tables: {len(doc.tables)}")
print(f"Sections: {len(doc.sections)}")
# Word count
text = ' '.join([p.text for p in doc.paragraphs])
print(f"Words: {len(text.split())}")
Extract all headings:
from docx import Document
doc = Document('document.docx')
for para in doc.paragraphs:
if para.style.name.startswith('Heading'):
print(f"{para.style.name}: {para.text}")
List styles used:
from docx import Document
doc = Document('document.docx')
styles = set(p.style.name for p in doc.paragraphs)
print("Styles used:", styles)
6. Modifying Existing DOCX
Add content to existing document:
from docx import Document
doc = Document('existing.docx')
# Add new paragraph
doc.add_paragraph('New paragraph added')
# Add page break
doc.add_page_break()
# Save
doc.save('modified.docx')
Replace text globally:
from docx import Document
def replace_text(doc_path, search, replace):
doc = Document(doc_path)
for paragraph in doc.paragraphs:
if search in paragraph.text:
paragraph.text = paragraph.text.replace(search, replace)
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
if search in cell.text:
cell.text = cell.text.replace(search, replace)
doc.save(doc_path)
7. Working with Tables
Extract table data:
from docx import Document
doc = Document('document.docx')
for i, table in enumerate(doc.tables):
print(f"\nTable {i+1}:")
for row in table.rows:
cells = [cell.text for cell in row.cells]
print('\t'.join(cells))
Create formatted table:
from docx import Document
doc = Document()
# Create table
table = doc.add_table(rows=1, cols=3)
table.style = 'Light List Accent 1'
# Header row
header_cells = table.rows[0].cells
header_cells[0].text = 'Name'
header_cells[1].text = 'Age'
header_cells[2].text = 'City'
# Data rows
data = [
('John', '30', 'New York'),
('Jane', '25', 'London')
]
for name, age, city in data:
row = table.add_row().cells
row[0].text = name
row[1].text = age
row[2].text = city
doc.save('table.docx')
8. Common Workflows
Create report from data
# Generate from JSON data
python3 << 'EOF'
import json
from docx import Document
from docx.shared import Inches
# Load data
with open('data.json') as f:
data = json.load(f)
# Create document
doc = Document()
doc.add_heading(data['title'], 0)
for section in data['sections']:
doc.add_heading(section['heading'], 1)
doc.add_paragraph(section['content'])
doc.save('report.docx')
EOF
Batch convert DOC to DOCX
# Convert all .doc files in directory
for file in *.doc; do
libreoffice --headless --convert-to docx "$file"
done
Extract all links
from docx import Document
doc = Document('document.docx')
for paragraph in doc.paragraphs:
for run in paragraph.runs:
if run.font.underline and 'http' in run.text:
print(run.text)
Tools Required
Install necessary tools:
Linux (Ubuntu/Debian):
sudo apt-get install pandoc libreoffice python3-pip antiword
pip3 install python-docx
macOS:
brew install pandoc libreoffice
pip3 install python-docx
Windows:
# Install Pandoc and LibreOffice manually
pip install python-docx
Security Notes
- ✅ Validate file paths before processing
- ✅ Check file sizes to prevent memory issues
- ✅ Sanitize user input in templates
- ✅ Be cautious with macro-enabled documents (.docm)
- ✅ Scan documents from untrusted sources
- ✅ Don't execute embedded scripts automatically
When to Use This Skill
Use /docx when the user:
- Wants to read or extract text from Word documents
- Needs to create DOCX files from templates or data
- Wants to convert DOCX to other formats (PDF, HTML, Markdown)
- Asks to process mail merge or template variables
- Needs to extract tables, images, or structure from documents
- Wants to batch process multiple Word files
- Asks to modify existing Word documents
Always confirm before overwriting existing files.
Weekly Installs
2
Repository
thechandanbhaga…e-skillsGitHub Stars
2
First Seen
Mar 1, 2026
Installed on
opencode2
gemini-cli2
codebuddy2
github-copilot2
codex2
kimi-cli2