DOCX Processing

Overview

Generate, manipulate, and template Word documents programmatically. This skill covers python-docx for direct document creation, docxtpl for Jinja2-based template filling, formatting control (headings, tables, images, headers/footers), mail merge operations, style management, and conversion strategies.

Apply this skill whenever Word documents need to be created, populated, or transformed through code rather than manual editing.

Multi-Phase Process

Phase 1: Requirements

Determine if creating from scratch or filling a template
Identify document structure (sections, headers, tables, images)
Define data sources (JSON, CSV, database, API)
Plan styling requirements (fonts, colors, margins)
Determine output format (DOCX, PDF conversion needed)

STOP — Do NOT begin implementation until the approach (scratch vs template) is decided and data sources are confirmed.

Phase 2: Implementation

Set up document template or create from scratch
Implement data binding and content generation
Apply formatting and styles
Add headers, footers, and page numbers
Handle images and embedded objects

STOP — Do NOT skip to validation until all document sections are implemented.

Phase 3: Validation

Verify document renders correctly in Word/LibreOffice
Check formatting consistency across pages
Validate data accuracy in generated documents
Test with edge cases (long text, missing data, special characters)
Verify PDF conversion if required

Approach Decision Table

Scenario	Approach	Library	Why
One-off report generation	From scratch	python-docx	Full programmatic control
Recurring reports with fixed layout	Template	docxtpl	Design layout in Word, fill with data
Bulk letter generation (mail merge)	Template	docxtpl	One template, many outputs
Complex formatting, custom styles	From scratch	python-docx	Direct access to document model
Non-technical users design template	Template	docxtpl	Users edit in Word, developers bind data
PDF output required	Either + conversion	libreoffice / docx2pdf	Post-processing step

python-docx Patterns

Document Creation

from docx import Document
from docx.shared import Inches, Pt, Cm, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_TABLE_ALIGNMENT

doc = Document()

# Set default font
style = doc.styles['Normal']
font = style.font
font.name = 'Calibri'
font.size = Pt(11)

# Add heading
doc.add_heading('Monthly Report', level=0)

# Add paragraph with formatting
para = doc.add_paragraph()
run = para.add_run('Important: ')
run.bold = True
run.font.color.rgb = RGBColor(0xCC, 0x00, 0x00)
para.add_run('This section requires attention.')

# Add table
table = doc.add_table(rows=1, cols=3, style='Light Grid Accent 1')
hdr_cells = table.rows[0].cells
hdr_cells[0].text = 'Name'
hdr_cells[1].text = 'Department'
hdr_cells[2].text = 'Revenue'

for name, dept, rev in data:
    row_cells = table.add_row().cells
    row_cells[0].text = name
    row_cells[1].text = dept
    row_cells[2].text = f'${rev:,.2f}'

# Add image
doc.add_picture('chart.png', width=Inches(5.5))

# Save
doc.save('report.docx')

Headers and Footers

from docx.enum.section import WD_ORIENT
from docx.oxml.ns import qn
from docx.oxml import OxmlElement

section = doc.sections[0]

# Page setup
section.page_width = Cm(21)
section.page_height = Cm(29.7)
section.left_margin = Cm(2.5)
section.right_margin = Cm(2.5)
section.top_margin = Cm(2.5)
section.bottom_margin = Cm(2.5)

# Header
header = section.header
header_para = header.paragraphs[0]
header_para.text = 'Company Name — Confidential'
header_para.alignment = WD_ALIGN_PARAGRAPH.RIGHT
header_para.style.font.size = Pt(9)
header_para.style.font.color.rgb = RGBColor(0x88, 0x88, 0x88)

# Footer with page numbers
footer = section.footer
footer_para = footer.paragraphs[0]
footer_para.alignment = WD_ALIGN_PARAGRAPH.CENTER

# Add page number field
run = footer_para.add_run()
fldChar = OxmlElement('w:fldChar')
fldChar.set(qn('w:fldCharType'), 'begin')
run._r.append(fldChar)

run2 = footer_para.add_run()
instrText = OxmlElement('w:instrText')
instrText.set(qn('xml:space'), 'preserve')
instrText.text = ' PAGE '
run2._r.append(instrText)

run3 = footer_para.add_run()
fldChar2 = OxmlElement('w:fldChar')
fldChar2.set(qn('w:fldCharType'), 'end')
run3._r.append(fldChar2)

Table Formatting

from docx.shared import Cm, Pt
from docx.oxml.ns import nsdecls
from docx.oxml import parse_xml

# Set column widths
table.columns[0].width = Cm(4)
table.columns[1].width = Cm(6)
table.columns[2].width = Cm(3)

# Cell shading
for cell in table.rows[0].cells:
    shading = parse_xml(f'<w:shd {nsdecls("w")} w:fill="2F5496"/>')
    cell._tc.get_or_add_tcPr().append(shading)
    for paragraph in cell.paragraphs:
        for run in paragraph.runs:
            run.font.color.rgb = RGBColor(0xFF, 0xFF, 0xFF)
            run.font.bold = True

# Cell alignment
for row in table.rows:
    for cell in row.cells:
        cell.paragraphs[0].alignment = WD_ALIGN_PARAGRAPH.CENTER

docxtpl Template Patterns

Template Syntax (Jinja2)

Template file (template.docx) contains:

{{ company_name }}
Date: {{ report_date }}

Dear {{ recipient_name }},

{% for item in items %}
- {{ item.name }}: ${{ item.price }}
{% endfor %}

Total: ${{ total }}

{%if urgent %}
URGENT: This requires immediate attention.
{%endif %}

Template Rendering

from docxtpl import DocxTemplate, InlineImage
from docx.shared import Mm

tpl = DocxTemplate('template.docx')

context = {
    'company_name': 'Acme Corp',
    'report_date': '2025-03-15',
    'recipient_name': 'Alice Johnson',
    'items': [
        {'name': 'Widget A', 'price': '29.99'},
        {'name': 'Widget B', 'price': '49.99'},
    ],
    'total': '79.98',
    'urgent': True,
    'chart': InlineImage(tpl, 'chart.png', width=Mm(120)),
}

tpl.render(context)
tpl.save('output.docx')

Rich Text in Templates

from docxtpl import RichText

rt = RichText()
rt.add('Normal text ')
rt.add('bold text', bold=True)
rt.add(' and ')
rt.add('red text', color='FF0000')
rt.add(' with ')
rt.add('a link', url_id=tpl.build_url_id('https://example.com'))

context = {'formatted_text': rt}

Tables in Templates

Template table row with loop:
{% tr for row in table_data %}
{{ row.name }} | {{ row.value }} | {{ row.status }}
{% endtr %}

Mail Merge

from docxtpl import DocxTemplate
import csv

template = DocxTemplate('letter_template.docx')

with open('recipients.csv') as f:
    reader = csv.DictReader(f)
    for i, row in enumerate(reader):
        context = {
            'name': row['name'],
            'address': row['address'],
            'amount': row['amount'],
            'due_date': row['due_date'],
        }
        template.render(context)
        template.save(f'letters/letter_{i:04d}_{row["name"]}.docx')
        template = DocxTemplate('letter_template.docx')  # Re-load for next iteration

Style Management

Custom Styles

from docx.enum.style import WD_STYLE_TYPE

# Create custom paragraph style
style = doc.styles.add_style('CustomHeading', WD_STYLE_TYPE.PARAGRAPH)
style.font.name = 'Arial'
style.font.size = Pt(16)
style.font.bold = True
style.font.color.rgb = RGBColor(0x2F, 0x54, 0x96)
style.paragraph_format.space_before = Pt(12)
style.paragraph_format.space_after = Pt(6)

# Apply custom style
doc.add_paragraph('Section Title', style='CustomHeading')

Style Inheritance

Normal → Heading 1 → Heading 2 → ...
Normal → Body Text → List Paragraph
Normal → Table Normal → Table Grid

Conversion Strategies

DOCX to PDF

# Option 1: LibreOffice (most reliable, server-friendly)
import subprocess
subprocess.run([
    'libreoffice', '--headless', '--convert-to', 'pdf',
    '--outdir', output_dir, input_file
])

# Option 2: docx2pdf (Windows/macOS with Word installed)
from docx2pdf import convert
convert('input.docx', 'output.pdf')

# Option 3: Generate PDF directly with reportlab for full control

Error Handling

import jinja2

def safe_generate_document(template_path, context, output_path):
    try:
        tpl = DocxTemplate(template_path)
        tpl.render(context)
        tpl.save(output_path)
        return True
    except jinja2.UndefinedError as e:
        print(f"Missing template variable: {e}")
        return False
    except FileNotFoundError as e:
        print(f"Template not found: {e}")
        return False
    except Exception as e:
        print(f"Document generation failed: {e}")
        return False

Anti-Patterns / Common Mistakes

Anti-Pattern	Why It Fails	What To Do Instead
Hardcoding font sizes instead of styles	Inconsistent formatting, hard to maintain	Define styles once, apply everywhere
Not handling missing template variables	Runtime crashes on incomplete data	Use `jinja2.Undefined` or default filters
Huge tables without pagination	Unreadable output, broken layouts	Break tables across pages or summarize
Absolute image paths	Breaks portability across environments	Use relative paths or embed images
Not testing with different Word versions	Formatting breaks silently	Test in Word, LibreOffice, and Google Docs
Modifying XML directly when API exists	Fragile, version-dependent code	Use python-docx API methods first
All direct formatting, no styles	Impossible to maintain consistency	Create and apply named styles
Ignoring Unicode characters	Mojibake in generated documents	Test with accented characters, CJK, symbols
Not re-loading template in mail merge	Corrupted output after first render	Re-instantiate DocxTemplate per iteration

Anti-Rationalization Guards

Do NOT skip the approach decision (scratch vs template) -- it determines your entire implementation.
Do NOT generate documents without testing in at least Word and one alternative viewer.
Do NOT ignore missing data -- handle empty/null fields with defaults or conditional sections.
Do NOT skip error handling in production document generation pipelines.
Do NOT hardcode formatting when styles can be used instead.

Integration Points

Skill	How It Connects
`pdf-processing`	DOCX-to-PDF conversion, or choosing PDF generation directly
`xlsx-processing`	Data from Excel feeds into document generation contexts
`email-composer`	Generated documents attach to professional emails
`content-research-writer`	Research content formatted into whitepapers and reports
`file-organizer`	Output file naming and directory structure conventions
`deployment`	Document generation pipelines in CI/CD or server environments

Skill Type

FLEXIBLE — Choose between python-docx (programmatic) and docxtpl (template-based) based on document complexity. Simple reports may not need templates; complex recurring documents benefit from templates.

docx-processing