pdf-brain-ingest
PDF Brain Ingest v2 (ADR-0234)
Staged artifact-chain pipeline with durable NAS storage, nomic embeddings, and workload queue orchestration.
Pipeline v2 Architecture
PDF (source, immutable)
→ Stage 1: CONVERT — opendataloader-pdf → {docId}.md (NAS artifact)
→ Stage 2: CLASSIFY + SUMMARIZE — taxonomy + LLM summary → {docId}.meta.json (NAS artifact)
→ Stage 3: CHUNK — markdown-native headings, no overlap → {docId}.chunks.jsonl (NAS artifact)
→ Stage 4: INDEX — upsert to docs + docs_chunks_v2 (nomic-embed-text-v1.5, 768-dim)
Key properties:
- Durable: artifacts on NAS RAID5, survive reboots/crashes
- Resumable: each stage checks for existing artifacts, skips if present
- Recoverable: re-run any stage from existing artifacts without re-extracting
- Observable: OTEL event per stage per book
Artifacts dir: /Volumes/three-body/docs-artifacts/{docId}/
{docId}.md— structured markdown extraction{docId}.meta.json— taxonomy, summary, metadata{docId}.chunks.jsonl— chunk records, one per line
Core Workflow
1) Preflight
joelclaw status
joelclaw docs status
Status now shows both v1 and v2 collection stats plus artifact availability.
2) Single File Ingest (v1 pipeline)
joelclaw docs add "/absolute/path/to/file.pdf"
joelclaw docs add "/absolute/path/to/file.pdf" --title "Title" --tags "tag1,tag2" --category programming
3) Single File v2 Reindex (artifact pipeline)
joelclaw docs reindex-v2 "/absolute/path/to/file.pdf"
joelclaw docs reindex-v2 "/absolute/path/to/file.pdf" --title "Title" --skip-existing
Fires docs/reindex-v2.requested → 4-stage artifact pipeline → NAS artifacts + docs_chunks_v2.
4) Batch Reindex (full library)
# Reindex all PDFs from NAS /Volumes/three-body/books/
joelclaw docs batch-reindex --skip-existing
# Reindex from existing Typesense docs collection
joelclaw docs batch-reindex --from-collection --skip-existing
Fires docs/reindex-batch.requested → scans NAS/collection → dispatches individual reindex-v2 events in batches of 10, concurrency 3.
--skip-existing skips books that already have all 3 artifacts on NAS (default true).
5) Monitor Progress
# Artifact count on NAS
ls /Volumes/three-body/docs-artifacts/ | wc -l
# v2 collection chunk count
joelclaw docs status
# OTEL events from pipeline
joelclaw otel search "docs.reindex" --hours 4
joelclaw o11y session system-bus --hours 4
# Individual run trace
joelclaw runs --count 10
joelclaw run <run-id>
6) Inspect Artifacts
# Read the extracted markdown
joelclaw docs markdown <doc-id>
# Read the summary + taxonomy metadata
joelclaw docs summary <doc-id>
# Or directly on NAS
cat /Volumes/three-body/docs-artifacts/<docId>/<docId>.md
cat /Volumes/three-body/docs-artifacts/<docId>/<docId>.meta.json | jq
wc -l /Volumes/three-body/docs-artifacts/<docId>/<docId>.chunks.jsonl
7) Retrieval from v2
# Search uses docs_chunks_v2 (nomic 768-dim) by default
joelclaw docs search "distributed consensus" --limit 8
# Context expansion
joelclaw docs context <chunk-id> --mode snippet-window
joelclaw docs context <chunk-id> --mode parent-section
joelclaw docs context <chunk-id> --mode section-neighborhood
8) Coverage Reconcile
joelclaw docs reconcile --sample 20
9) Recovery
If the batch stalls or books fail:
- Inngest retries each step automatically (default retry policy)
--skip-existingmeans re-firing the batch only processes unfinished books- Check OTEL for errors:
joelclaw otel list --level error --hours 4 - Individual retry:
joelclaw docs reindex-v2 "/path/to/failed.pdf"
Extraction Details
- Primary: opendataloader-pdf v2.0.0 (Java-based, #1 in benchmarks, 0.90 accuracy)
- Fallback: pypdf (basic text extraction if Java unavailable)
- Requires: Java 11+ (OpenJDK 25 installed on panda, PATH configured in worker start.sh)
- Speed: ~2.5s per book on M4 Pro
Embedding Model
- v2: nomic-embed-text via ollama (GPU-accelerated on M4 Pro, 768-dim, retrieval-tuned). Pre-computed at ingest time, stored as raw float[] vectors. ~150x faster than Typesense CPU auto-embed.
- v1:
ts/all-MiniLM-L12-v2— 384-dim, general-purpose, Typesense auto-embed (legacy, still indocs_chunks) - Ollama runs on panda at
localhost:11434. System-bus-worker (host process) embeds at ingest time.
Chunking Strategy (ADR-0234)
- Markdown-native heading detection (
#markers, not heuristics) - Recursive splitting within sections exceeding target tokens
- No overlap (arxiv R100-0 finding: 45% higher precision)
- Two-level hierarchy: section chunks + snippet sub-chunks
- heading_path derived from actual markdown heading levels
- Context inheritance: retrieval_text includes
[DOC: title] [SUMMARY: ...] [PATH: heading > path] [CONCEPTS: ...]
Acquisition Pipeline (aa-book → ingest)
joelclaw send pipeline/book.download -d '{
"query": "designing data-intensive applications",
"format": "pdf",
"reason": "library expansion"
}'
Downloads via aa-book → NAS backup → fires docs/ingest.requested for immediate processing.
Inngest Events
| Event | Function | Purpose |
|---|---|---|
docs/ingest.requested |
docs-ingest | v1 pipeline (single file) |
docs/reindex-v2.requested |
docs-reindex-v2 | v2 artifact pipeline (single file) |
docs/reindex-batch.requested |
docs-reindex-batch | Batch orchestrator (all PDFs) |
docs/backlog.requested |
docs-backlog | Legacy manifest-based backfill |
docs/enrich.requested |
docs-enrich | Re-enrich metadata for existing doc |
pipeline/book.download |
book-download | Acquire + ingest new book |
More from joelhooks/joelclaw
cli-design
Design and build agent-first CLIs with HATEOAS JSON responses, context-protecting output, and self-documenting command trees. Use when creating new CLI tools, adding commands to existing CLIs (joelclaw, slog), or reviewing CLI design for agent-friendliness. Triggers on 'build a CLI', 'add a command', 'CLI design', 'agent-friendly output', or any task involving command-line tool creation.
129k8s
>-
88docker-sandbox
Create, manage, and execute agent tools (claude, codex) inside Docker sandboxes for isolated code execution. Use when running agent loops, spawning tool subprocesses, or any task requiring process isolation. Triggers on "sandbox", "isolated execution", "docker sandbox", "safe agent execution", or when working on agent loop infrastructure.
86joel-writing-style
Joel's writing voice and style guide for joelclaw.com content. Use when writing, editing, or reviewing any blog post, essay, book chapter, or prose content for joelclaw.com. Also use when asked to 'write like Joel,' 'match Joel's voice,' 'draft a post,' 'write content for the blog,' or 'review this for voice.' This skill captures Joel's specific writing patterns derived from ~90,000 words of published content spanning 2012–2026. Cross-reference with copy-editing and copywriting skills for marketing-specific copy.
81task-management
Manage Joel's task system in Todoist. Triggers on: 'add a task', 'create a todo', 'what's on my list', 'today's tasks', 'what do I need to do', 'remind me to', 'inbox', 'complete', 'mark done', 'weekly review', 'groom tasks', 'what's next', or when actionable items emerge from other work. Also triggers when Joel mentions something he needs to do in passing — capture it.
54skill-review
Audit and maintain the joelclaw skill inventory. Use when checking skill health, fixing broken symlinks, finding stale skills, or running the skill garden. Triggers: 'skill audit', 'check skills', 'stale skills', 'skill health', 'skill garden', 'broken skill', 'skill review', 'fix skills', 'garden skills', or any task involving skill inventory maintenance.
49