pdf-word-reader-zh
Installation
SKILL.md
PDF Word Reader ZH
Purpose
Turn a document path into three analysis artifacts:
- structured extraction
- chunked evidence index
- understanding report scaffold
Use this skill when output quality requires traceable evidence, not only a short summary.
Input Support
.pdf.docx.pptx(convert to PDF first)
For .doc, convert to .docx before using this skill.
One-Command Execution
python scripts/prepare_document_context.py "<input-file>" --output-dir "output/document-understanding"
Decision Flow
- Detect input type by extension.
- If
.pptx, convert to PDF:
- Prefer
sofficewhen available. - Fallback to Microsoft PowerPoint COM on Windows.
- Extract content:
- PDF: page text + table extraction; use OCR fallback for low-text pages.
- DOCX: paragraphs + table extraction.
- Normalize and chunk long text into stable chunk IDs (
C001,C002, ...). - Generate understanding scaffold with evidence index.
Output Contract
Write the following files to output directory:
01_extracted.json02_chunks.json03_understanding_report.md
01_extracted.json
Must include:
source_filefile_typefull_text- structure fields (
pagesorparagraphs/tables) - warnings and runtime metadata when available
02_chunks.json
Must include:
chunk_countchunks[]with fields:chunk_idchar_countestimated_tokenstext
03_understanding_report.md
Must include:
- document profile
- key lines
- chunk evidence index
- deliverable template for final analysis
Recommended Parameters
--disable-ocr: disable OCR fallback--max-pages N: quick verification on first N pages--fail-on-empty: stop when no text extracted
Error Handling Rules
If conversion/extraction fails:
- Return a concrete cause (missing dependency, unsupported format, conversion error).
- Provide exact next action (install command or format conversion step).
- Do not fabricate extracted content.
If extracted text is low quality:
- Keep warnings in output metadata.
- Continue chunking with available text.
- Explicitly mark uncertainty in report.
Dependency Baseline
Install Python deps:
python -m pip install -r requirements.txt
Recommended system tools:
tesseract+chi_simpdftoppm(Poppler)soffice(LibreOffice), or Microsoft PowerPoint (Windows)
Final Analysis Rules
When producing final conclusions from output artifacts:
- Read all chunks before writing conclusions.
- Cite chunk IDs for key claims, e.g.
[C003][C011]. - Separate facts from assumptions.
- List missing evidence and unresolved questions.
- Keep numeric claims exactly aligned with extracted text.
Minimal Working Examples
PDF:
python scripts/prepare_document_context.py "./report.pdf" --output-dir "./out"
DOCX:
python scripts/prepare_document_context.py "./report.docx" --output-dir "./out"
PPTX:
python scripts/prepare_document_context.py "./slides.pptx" --output-dir "./out"