docx
DOCX Creation, Editing, and Analysis
Overview
A .docx file is a ZIP archive containing XML files.
Quick Reference
| Task | Approach |
|---|---|
| Read/analyze content | pandoc or unpack for raw XML |
| Create new document | Use docx-js -- see Creating New Documents below |
| Edit existing document | Unpack -> edit XML -> repack -- see Editing Existing Documents below |
| Visual review | Convert DOCX -> PDF -> PNGs via soffice + pdftoppm or scripts/render_docx.py |
Converting .doc to .docx
python scripts/office/soffice.py --headless --convert-to docx document.doc
Reading Content
pandoc --track-changes=all document.docx -o output.md
python scripts/office/unpack.py document.docx unpacked/
Converting to Images
python scripts/office/soffice.py --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf page
Or use the bundled render script:
python scripts/render_docx.py /path/to/file.docx --output_dir /tmp/docx_pages
Accepting Tracked Changes
python scripts/accept_changes.py input.docx output.docx
Creating New Documents
Generate .docx files with JavaScript, then validate. Install: npm install -g docx
Setup
const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun,
Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType,
VerticalAlign, PageNumber, PageBreak } = require('docx');
const doc = new Document({ sections: [{ children: [/* content */] }] });
Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer));
Validation
After creating the file, validate it. If validation fails, unpack, fix the XML, and repack.
python scripts/office/validate.py doc.docx
Page Size
// CRITICAL: docx-js defaults to A4, not US Letter
// Always set page size explicitly
sections: [{
properties: {
page: {
size: {
width: 12240, // 8.5 inches in DXA
height: 15840 // 11 inches in DXA
},
margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 }
}
},
children: [/* content */]
}]
| Paper | Width | Height | Content Width (1" margins) |
|---|---|---|---|
| US Letter | 12,240 | 15,840 | 9,360 |
| A4 (default) | 11,906 | 16,838 | 9,026 |
Landscape: pass portrait dimensions and let docx-js swap:
size: { width: 12240, height: 15840, orientation: PageOrientation.LANDSCAPE }
Styles (Override Built-in Headings)
Use Arial as default font. Keep titles black for readability.
const doc = new Document({
styles: {
default: { document: { run: { font: "Arial", size: 24 } } },
paragraphStyles: [
{ id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 32, bold: true, font: "Arial" },
paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } },
{ id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 28, bold: true, font: "Arial" },
paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } },
]
},
sections: [{ children: [
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }),
]}]
});
Lists (NEVER use unicode bullets)
// WRONG
new Paragraph({ children: [new TextRun("\u2022 Item")] })
// CORRECT - use numbering config
const doc = new Document({
numbering: {
config: [
{ reference: "bullets",
levels: [{ level: 0, format: LevelFormat.BULLET, text: "\u2022", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
{ reference: "numbers",
levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
]
},
sections: [{ children: [
new Paragraph({ numbering: { reference: "bullets", level: 0 }, children: [new TextRun("Item")] }),
]}]
});
Tables
CRITICAL: Set both columnWidths on the table AND width on each cell.
const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
const borders = { top: border, bottom: border, left: border, right: border };
new Table({
width: { size: 9360, type: WidthType.DXA },
columnWidths: [4680, 4680],
rows: [
new TableRow({
children: [
new TableCell({
borders,
width: { size: 4680, type: WidthType.DXA },
shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, // CLEAR not SOLID
margins: { top: 80, bottom: 80, left: 120, right: 120 },
children: [new Paragraph({ children: [new TextRun("Cell")] })]
})
]
})
]
})
Always use WidthType.DXA -- never WidthType.PERCENTAGE (breaks in Google Docs).
Images
new Paragraph({
children: [new ImageRun({
type: "png", // Required: png, jpg, jpeg, gif, bmp, svg
data: fs.readFileSync("image.png"),
transformation: { width: 200, height: 150 },
altText: { title: "Title", description: "Desc", name: "Name" }
})]
})
Page Breaks
new Paragraph({ children: [new PageBreak()] })
// Or: new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] })
Table of Contents
new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" })
Headers/Footers
sections: [{
headers: {
default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] })
},
footers: {
default: new Footer({ children: [new Paragraph({
children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })]
})] })
},
children: [/* content */]
}]
Critical Rules for docx-js
- Set page size explicitly (defaults to A4)
- Landscape: pass portrait dimensions, set
orientation: PageOrientation.LANDSCAPE - Never use
\n-- use separate Paragraph elements - Never use unicode bullets -- use
LevelFormat.BULLET - PageBreak must be in Paragraph
- ImageRun requires
type - Always use
WidthType.DXAfor tables - Tables need dual widths:
columnWidths+ cellwidth - Use
ShadingType.CLEAR, never SOLID - TOC requires HeadingLevel only
- Override built-in styles with exact IDs: "Heading1", "Heading2", etc.
- Include
outlineLevelfor TOC (0 for H1, 1 for H2)
Editing Existing Documents
Follow all 3 steps in order.
Step 1: Unpack
python scripts/office/unpack.py document.docx unpacked/
Step 2: Edit XML
Edit files in unpacked/word/. See XML Reference below.
Use the Edit tool directly for string replacement. Do not write Python scripts.
Use smart quotes for new content:
<w:t>Here’s a quote: “Hello”</w:t>
| Entity | Character |
|---|---|
‘ |
' (left single) |
’ |
' (right single / apostrophe) |
“ |
" (left double) |
” |
" (right double) |
Adding comments: Use comment.py:
python scripts/comment.py unpacked/ 0 "Comment text with & and ’"
python scripts/comment.py unpacked/ 1 "Reply text" --parent 0
python scripts/comment.py unpacked/ 0 "Text" --author "Kortix Agent"
Then add markers to document.xml (see Comments in XML Reference).
Step 3: Pack
python scripts/office/pack.py unpacked/ output.docx --original document.docx
Common Pitfalls
- Replace entire
<w:r>elements when adding tracked changes - Preserve
<w:rPr>formatting -- copy original run's formatting into tracked change runs
XML Reference
Schema Compliance
- Element order in
<w:pPr>:<w:pStyle>,<w:numPr>,<w:spacing>,<w:ind>,<w:jc>,<w:rPr>last - Whitespace: Add
xml:space="preserve"to<w:t>with leading/trailing spaces - RSIDs: Must be 8-digit hex (e.g.,
00AB1234)
Tracked Changes
Insertion:
<w:ins w:id="1" w:author="Kortix Agent" w:date="2025-01-01T00:00:00Z">
<w:r><w:t>inserted text</w:t></w:r>
</w:ins>
Deletion:
<w:del w:id="2" w:author="Kortix Agent" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>
Inside <w:del>: use <w:delText> instead of <w:t>.
Minimal edits -- only mark what changes:
<w:r><w:t>The term is </w:t></w:r>
<w:del w:id="1" w:author="Kortix Agent" w:date="...">
<w:r><w:delText>30</w:delText></w:r>
</w:del>
<w:ins w:id="2" w:author="Kortix Agent" w:date="...">
<w:r><w:t>60</w:t></w:r>
</w:ins>
<w:r><w:t> days.</w:t></w:r>
Deleting entire paragraphs -- also mark paragraph mark as deleted:
<w:p>
<w:pPr>
<w:rPr>
<w:del w:id="1" w:author="Kortix Agent" w:date="2025-01-01T00:00:00Z"/>
</w:rPr>
</w:pPr>
<w:del w:id="2" w:author="Kortix Agent" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>Entire paragraph content...</w:delText></w:r>
</w:del>
</w:p>
Rejecting another author's insertion:
<w:ins w:author="Jane" w:id="5">
<w:del w:author="Kortix Agent" w:id="10">
<w:r><w:delText>their inserted text</w:delText></w:r>
</w:del>
</w:ins>
Restoring another author's deletion:
<w:del w:author="Jane" w:id="5">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>
<w:ins w:author="Kortix Agent" w:id="10">
<w:r><w:t>deleted text</w:t></w:r>
</w:ins>
Comments
Markers are direct children of w:p, never inside w:r:
<w:commentRangeStart w:id="0"/>
<w:r><w:t>commented text</w:t></w:r>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
Images
- Add image file to
word/media/ - Add relationship to
word/_rels/document.xml.rels - Add content type to
[Content_Types].xml - Reference in document.xml with
<w:drawing>
Visual Review Workflow
- Convert DOCX -> PDF -> PNGs
- Use
scripts/render_docx.py(requirespdf2imageand Poppler) - After each meaningful change, re-render and inspect
- If visual review is not possible, extract text with
python-docxas fallback
Quality Expectations
- Deliver client-ready documents: consistent typography, spacing, margins, clear hierarchy
- Avoid formatting defects: clipped/overlapping text, broken tables, unreadable characters
- Use ASCII hyphens only. Avoid U+2011 and other Unicode dashes
- Re-render and inspect every page at 100% zoom before final delivery
Dependencies
- pandoc: Text extraction
- docx:
npm install -g docx(new documents) - LibreOffice: PDF conversion (via
scripts/office/soffice.py) - Poppler:
pdftoppmfor images - python-docx: Reading/analysis (
pip install python-docx) - pdf2image: Rendering (
pip install pdf2image)