skills/joelhooks/joelclaw/pdf-brain-ingest

pdf-brain-ingest

SKILL.md

PDF Brain Ingest (Joelclaw)

This is the joelclaw-native replacement for pdf-brain + swarm queue operations.

Use joelclaw docs and Inngest events instead of ad hoc queue workers:

  • docs/ingest.requested -> docs-ingest
  • docs/backlog.requested -> batch queueing from manifest
  • docs/backlog.drive.requested -> scheduled backlog driver with queue depth gates
  • docs/ingest.janitor.requested -> stuck-run detection and recovery

Core Workflow

1) Preflight

joelclaw status
joelclaw inngest status
joelclaw docs status

If registration is stale:

joelclaw inngest sync-worker --restart

2) Single File Ingest

joelclaw docs add "/absolute/path/to/file.pdf"
joelclaw docs add "/absolute/path/to/file.md"

Optional metadata:

joelclaw docs add "/absolute/path/to/file.pdf" \
  --title "Readable Title" \
  --tags "manifest,catalog-fill" \
  --category programming

Supported types: pdf, md, txt.

3) Bulk Backfill From Manifest

Queue a controlled batch:

joelclaw send docs/backlog.requested -d '{
  "maxEntries": 24,
  "booksOnly": true,
  "onlyMissing": true,
  "includePodcasts": false,
  "idempotencyPrefix": "manual"
}'

Let the driver decide based on queue depth:

joelclaw send docs/backlog.drive.requested -d '{
  "reason": "manual backfill kick",
  "maxEntries": 24,
  "force": false
}'

4) Monitor + Verify

joelclaw runs --count 20 --hours 1
joelclaw run <run-id>
joelclaw docs list --limit 20
joelclaw docs show <doc-id>
joelclaw docs search "your query"
joelclaw docs context <chunk-id> --mode snippet-window

5) Coverage Reconcile

joelclaw docs reconcile --sample 20

Use content_equivalent coverage to detect false-missing churn caused by path/category aliasing.

6) OTEL Verification

joelclaw otel search "docs.file.validated" --hours 1
joelclaw otel search "docs.taxonomy.classified" --hours 1
joelclaw otel search "docs.chunks.indexed" --hours 1
joelclaw otel search "docs.path.aliases.updated" --hours 24

7) Recovery / Maintenance

joelclaw send docs/ingest.janitor.requested -d '{"reason":"manual janitor sweep"}'
joelclaw docs enrich <doc-id>
joelclaw docs reindex --doc <doc-id>
joelclaw docs reindex

Legacy Mapping (Old -> Joelclaw)

  • pdf-brain add <file> --enrich -> joelclaw docs add <absolute-path>
  • pdf-brain ingest <dir> --enrich -> joelclaw send docs/backlog.requested -d '{...}'
  • swarm queue submit pdf-ingest '{"path":"..."}' -> joelclaw docs add <absolute-path>
  • pdf-brain-worker (nice -n10, concurrency 1) -> built into docs-ingest + backlog driver + janitor

Acquisition Handoff (aa-book -> Inngest, end to end)

Use the event workflow so acquisition, inference, download, and docs queueing stay durable:

joelclaw send pipeline/book.download -d '{
  "query": "designing data-intensive applications",
  "format": "pdf",
  "reason": "memory backfill"
}'

Behavior:

  • Runs aa-book search
  • Uses pi inference (Sonnet 4.6 model alias from system-bus model registry) to select MD5
  • Runs aa-book download <md5> <outputDir> --keep-local
  • Attempts a non-fatal NAS backup to /volume1/home/joel/books/<year>/... via SSH/SCP
  • Emits docs/ingest.requested with the local filePath for immediate ingest and nasPath when backup succeeds
  • Emits pipeline/book.downloaded

Optional direct MD5 mode:

joelclaw send pipeline/book.download -d '{
  "md5": "0123456789abcdef0123456789abcdef",
  "outputDir": "/Users/joel/clawd/data/pdf-brain/incoming"
}'

For full operator details and troubleshooting traces, see:

  • references/operator-guide.md
Weekly Installs
25
GitHub Stars
49
First Seen
Feb 27, 2026
Installed on
cline25
github-copilot25
codex25
kimi-cli25
gemini-cli25
cursor25