large-document-processing
Large Document Processing & Intelligent Text Chunking
⚠️ Repo Reality Check (read this first)
The real components are:
- Top-level pipeline:
LargeDocumentProcessor- Structure-aware parser:
AdvancedDocumentParser- Streaming OCR with progress:
EnhancedOCRProcessor- Chunker:
IntelligentTextChunker— see the intelligent-text-chunking skill.- Training data generation:
AITrainingDataGenerator- Setup helper:
scripts/setup_large_document_processing.pyThe NWT EPUB parser exposes only
get_verse(book_num, chapter, verse)(nwt_epub_parser.py) — there is noget_chapter/get_book. See the bible-epub-processing skill.Source data lives under
config/data/(NOT a top-leveldata/).Always wrap chunking calls with
protect_scripture_references/restore_scripture_referencesfromsrc/utils/scripture_parser.pywhen input may contain Bible references.
Overview
Two tightly related concerns combined here:
- Large document parsing — DOCX/PDF/EPUB ingestion with structure preservation
- Intelligent text chunking — splitting parsed text into semantically coherent pieces for AI training or RAG
Source Files
| File | Purpose |
|---|---|
src/utils/nwt_epub_parser.py |
EPUB parser for NWT Bible (English + Chuukese) |
scripts/extract_jwpub.py |
Extract JW publication .jwpub archives |
scripts/setup_large_document_processing.py |
One-time document pipeline setup |
output/processed_document/ |
Output directory for processed content |
Document Processing
Supported Formats
- DOCX via
python-docx - PDF via
PyMuPDF(import asfitz) — note:fitz==0.0.1.dev2is NOT in requirements; usePyMuPDFonly - EPUB via
ebooklib+NWTEpubParser - Plain text / CSV — direct read
EPUB Pattern (NWT Bible)
from src.utils.nwt_epub_parser import NWTEpubParser
parser = NWTEpubParser('data/bible/nwt_E.epub')
verse_text = parser.get_verse('John', 3, 16)
chapter_verses = parser.get_chapter('Genesis', 1)
PDF/DOCX Pattern
import fitz # PyMuPDF — installed as PyMuPDF, exposed as fitz
doc = fitz.open('large_document.pdf')
for page_num, page in enumerate(doc):
text = page.get_text()
# process text...
Intelligent Text Chunking
Strategy Selection
| Strategy | Use case |
|---|---|
| Semantic | AI training data — respect topic/paragraph boundaries |
| Structural | Documents with clear headings/sections |
| Fixed-size | RAG systems needing predictable chunk sizes |
| Sliding window | QA tasks needing context overlap |
Implementation Pattern
# Sentence-boundary-aware chunking
def chunk_text(text: str, max_chars: int = 1024, overlap: int = 100) -> list[str]:
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks, current = [], ''
for sent in sentences:
if len(current) + len(sent) > max_chars and current:
chunks.append(current.strip())
current = current[-overlap:] + ' ' + sent # overlap
else:
current += ' ' + sent
if current.strip():
chunks.append(current.strip())
return chunks
Chuukese-aware chunking
# Chuukese uses the same sentence terminators as English
SENTENCE_ENDINGS = re.compile(r'(?<=[.!?])\s+')
def detect_language(text: str) -> str:
has_accents = bool(re.search(r'[áéíóú]', text))
return 'chuukese' if has_accents else 'english'
Memory Efficiency
- Process large PDFs page-by-page, not loading the full DOM into memory
- Stream EPUB chapters — do not load the entire book at once
- Write chunk output incrementally to JSONL files rather than accumulating in RAM
Output Formats
- JSONL: one JSON object per line — best for large training datasets
- JSON array: for smaller batches consumed by the frontend
- Plain text: cleaned extracted text for inspection
Dependencies
PyMuPDF==1.23.8— PDF processing (do NOT addfitz==0.0.1.dev2)python-docx>=1.2.0ebooklib>=0.18beautifulsoup4>=4.12.0
More from findinfinitelabs/chuuk
python-venv-management
Automatically manage Python virtual environments (.venv) in terminal commands. Always activate .venv before running Python/pip commands. Supports macOS, Linux, and Windows with shell-aware activation. Use when executing Python scripts, installing packages, or running development servers. Critical for consistent environment management.
14bible-epub-processing
Parse and extract structured content from Bible EPUBs (NWT) for parallel text alignment between Chuukese and English. Use when working with Bible data, verse extraction, parallel corpus building, or generating training data from Scripture.
14security-environment-standards
Security and environment configuration standards for web applications, including environment variable management, secure coding practices, and production deployment security. Use when setting up environments, configuring security, or deploying applications.
13intelligent-text-chunking
Split large texts into meaningful, AI-optimized chunks while preserving semantic coherence and document structure. Covered by the large-document-processing skill — see that skill for full details.
13document-ocr-processing
Process scanned documents and images containing Chuukese text using OCR with specialized post-processing for accent characters and traditional formatting. Use when working with scanned books, documents, or images that contain Chuukese text that needs to be digitized.
12react-typescript-frontend
Build React TypeScript frontends with Mantine UI v8, Vite, and type-safe API integrations. Use when creating or modifying the Chuuk Dictionary frontend, building React components, or working with TypeScript in the frontend.
11