docx
DOCX creation, editing, and analysis
Overview
A user may ask you to create, edit, or analyze the contents of a .docx file. A .docx file is essentially a ZIP archive containing XML files and other resources that you can read or edit. You have different tools and workflows available for different tasks.
Workflow Decision Tree
Reading/Analyzing Content
Use "Text extraction" or "Raw XML access" sections below
Creating New Document
Use "Creating a new Word document" workflow
Editing Existing Document
-
Your own document + simple changes Use "Basic OOXML editing" workflow
-
Someone else's document Use "Redlining workflow" (recommended default)
-
Legal, academic, business, or government docs Use "Redlining workflow" (required)
Reading and analyzing content
Text extraction
If you just need to read the text contents of a document, you should convert the document to markdown using pandoc. Pandoc provides excellent support for preserving document structure and can show tracked changes:
# Convert document to markdown with tracked changes
pandoc --track-changes=all path-to-file.docx -o output.md
# Options: --track-changes=accept/reject/all
Raw XML access
You need raw XML access for: comments, complex formatting, document structure, embedded media, and metadata.
Key file structures
word/document.xml- Main document contentsword/comments.xml- Comments referenced in document.xmlword/media/- Embedded images and media files- Tracked changes use
<w:ins>(insertions) and<w:del>(deletions) tags
Creating a new Word document
When creating a new Word document from scratch, use docx-js, which allows you to create Word documents using JavaScript/TypeScript.
Workflow
- Create a JavaScript/TypeScript file using Document, Paragraph, TextRun components
- Export as .docx using Packer.toBuffer()
Editing an existing Word document
When editing an existing Word document, work with the raw OOXML format by unpacking, editing XML, and repacking.
Workflow
- Unpack the document
- Create and run a Python script to edit the XML
- Pack the final document
Converting Documents to Images
To visually analyze Word documents, convert them to images using a two-step process:
-
Convert DOCX to PDF:
soffice --headless --convert-to pdf document.docx -
Convert PDF pages to JPEG images:
pdftoppm -jpeg -r 150 document.pdf pageThis creates files like
page-1.jpg,page-2.jpg, etc.
Code Style Guidelines
IMPORTANT: When generating code for DOCX operations:
- Write concise code
- Avoid verbose variable names and redundant operations
- Avoid unnecessary print statements
Dependencies
Required dependencies (install if not available):
- pandoc:
sudo apt-get install pandoc(for text extraction) - docx:
npm install -g docx(for creating new documents) - LibreOffice:
sudo apt-get install libreoffice(for PDF conversion) - Poppler:
sudo apt-get install poppler-utils(for pdftoppm to convert PDF to images) - defusedxml:
pip install defusedxml(for secure XML parsing)