Literature Engineer (evidence collector)

Goal: build a large, verifiable candidate pool for downstream dedupe/rank, mapping, notes, citations, and drafting.

This skill is intentionally evidence-first: if you can't reach the target size with verifiable IDs/provenance, the correct behavior is to block and ask for more exports / enable network, not to fabricate.

Load Order

Always read:

references/domain_pack_overview.md — how domain packs drive topic-specific behavior

Domain packs (loaded by topic match):

assets/domain_packs/llm_agents.json — pinned classic/survey arXiv IDs for LLM agent topics

Script Boundary

Use scripts/run.py only for:

multi-route offline import, normalization, and provenance tagging
online arXiv/Semantic Scholar API retrieval
snowball expansion and deduplication
retrieval report generation

Do not treat run.py as the place for:

hardcoded pinned arXiv ID lists (use domain packs)
hardcoded topic detection logic (use domain packs)

Inputs

queries.md
- keywords, exclude, max_results, time window
Optional offline sources (any combination; all are merged):
- papers/import.(csv|json|jsonl|bib)
- papers/arxiv_export.(csv|json|jsonl|bib)
- papers/imports/*.(csv|json|jsonl|bib)
Optional snowball exports (offline):
- papers/snowball/*.(csv|json|jsonl|bib)

Outputs

papers/papers_raw.jsonl
- 1 record per line; minimum fields:
  - title (str), authors (list[str]), year (int|""), url (str)
  - stable identifier(s): arxiv_id and/or doi
  - abstract (str; may be empty in offline mode)
  - source (str) + provenance (list[dict])
papers/papers_raw.csv (human scan)
papers/retrieval_report.md (route counts, missing-meta stats, next actions)

Workflow (multi-route)

Offline-first merge: ingest all available offline exports (and label provenance per file).
Online retrieval (optional): if enabled, run arXiv API retrieval for each keyword query.
Snowballing (optional): expand from seed papers via references/cited-by (online), or merge offline snowball exports.
Normalize + dedupe: canonicalize IDs/URLs, merge duplicates while unioning provenance.
Report: write a concise retrieval report with coverage buckets and missing-meta counts.

Quality checklist

Candidate pool size target met (A150++: ≥1200) without fabrication.
Each record has a stable identifier (arxiv_id or doi, plus url).
Each record has provenance: which route/file/API produced it.

Script

Quick Start

python .codex/skills/literature-engineer/scripts/run.py --help

All Options

See python .codex/skills/literature-engineer/scripts/run.py --help.
Reads retrieval config from queries.md.
Offline inputs (merged if present): papers/import.(csv|json|jsonl|bib), papers/arxiv_export.(csv|json|jsonl|bib), papers/imports/*.(csv|json|jsonl|bib).
Optional offline snowball inputs: papers/snowball/*.(csv|json|jsonl|bib).
Online expansion requires network: use --online and/or --snowball.
Online retrieval is best-effort: arXiv API can be flaky in some environments; the script will also attempt a Semantic Scholar route when needed.
For LLM-agent topics, the script also performs a best-effort pinned arXiv id_list fetch (canonical classics like ReAct/Toolformer/Reflexion/Voyager/Tree-of-Thoughts + a small prior-survey seed set) so ref.bib can include must-cite anchors even when keyword search misses them.
If HTTPS/TLS to external domains is unstable, the Semantic Scholar route is fetched via the r.jina.ai proxy so the pipeline can still self-boot without manual exports.
When an online run returns 0 records due to transient network errors, a simple rerun is often sufficient (the pipeline should not fabricate).

Examples

Offline imports only:
- Put exports under papers/imports/ then run:
  - python .codex/skills/literature-engineer/scripts/run.py --workspace <ws>
Explicit offline inputs (multi-route):
- python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --input path/to/a.bib --input path/to/b.jsonl
Online arXiv retrieval (needs network):
- python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --online
Snowballing (needs network unless you provide offline snowball exports):
- python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --snowball

Troubleshooting

Issue: can't reach ≥1200 papers

Symptom:

papers/papers_raw.jsonl size is far below target; later stages will fail mapping/bindings and citation density.

Causes:

Only a small offline export was provided.
Network is blocked so online retrieval/snowballing can't run.

Solutions:

Provide additional exports under papers/imports/ (multiple routes/queries).
Provide snowball exports under papers/snowball/.
Enable network and rerun with --online --snowball.

Issue: many records missing stable IDs

Symptom:

Report shows many entries with empty arxiv_id and doi.

Solutions:

Prefer arXiv/OpenReview/ACL exports that include stable IDs.
If you have network, rerun with --online to backfill arXiv IDs.
Filter out ID-less entries before downstream citation generation.

literature-engineer

Literature Engineer (evidence collector)

Load Order

Script Boundary

Inputs

Outputs

Workflow (multi-route)

Quality checklist

Script

Quick Start

All Options

Examples

Troubleshooting

Issue: can't reach ≥1200 papers

Issue: many records missing stable IDs

More from willoscar/research-units-pipeline-skills

pdf-text-extractor

latex-compile-qa

citation-verifier

draft-polisher

paper-notes

section-logic-polisher