literature-engineer
SKILL.md
Literature Engineer (evidence collector)
Goal: build a large, verifiable candidate pool for downstream dedupe/rank, mapping, notes, citations, and drafting.
This skill is intentionally evidence-first: if you can't reach the target size with verifiable IDs/provenance, the correct behavior is to block and ask for more exports / enable network, not to fabricate.
Load Order
Always read:
references/domain_pack_overview.md— how domain packs drive topic-specific behavior
Domain packs (loaded by topic match):
assets/domain_packs/llm_agents.json— pinned classic/survey arXiv IDs for LLM agent topics
Script Boundary
Use scripts/run.py only for:
- multi-route offline import, normalization, and provenance tagging
- online arXiv/Semantic Scholar API retrieval
- snowball expansion and deduplication
- retrieval report generation
Do not treat run.py as the place for:
- hardcoded pinned arXiv ID lists (use domain packs)
- hardcoded topic detection logic (use domain packs)
Inputs
queries.mdkeywords,exclude,max_results,time window
- Optional offline sources (any combination; all are merged):
papers/import.(csv|json|jsonl|bib)papers/arxiv_export.(csv|json|jsonl|bib)papers/imports/*.(csv|json|jsonl|bib)
- Optional snowball exports (offline):
papers/snowball/*.(csv|json|jsonl|bib)
Outputs
papers/papers_raw.jsonl- 1 record per line; minimum fields:
title(str),authors(list[str]),year(int|""),url(str)- stable identifier(s):
arxiv_idand/ordoi abstract(str; may be empty in offline mode)source(str) +provenance(list[dict])
- 1 record per line; minimum fields:
papers/papers_raw.csv(human scan)papers/retrieval_report.md(route counts, missing-meta stats, next actions)
Workflow (multi-route)
- Offline-first merge: ingest all available offline exports (and label provenance per file).
- Online retrieval (optional): if enabled, run arXiv API retrieval for each keyword query.
- Snowballing (optional): expand from seed papers via references/cited-by (online), or merge offline snowball exports.
- Normalize + dedupe: canonicalize IDs/URLs, merge duplicates while unioning
provenance. - Report: write a concise retrieval report with coverage buckets and missing-meta counts.
Quality checklist
- Candidate pool size target met (A150++: ≥1200) without fabrication.
- Each record has a stable identifier (
arxiv_idordoi, plusurl). - Each record has provenance: which route/file/API produced it.
Script
Quick Start
python .codex/skills/literature-engineer/scripts/run.py --help
All Options
- See
python .codex/skills/literature-engineer/scripts/run.py --help. - Reads retrieval config from
queries.md. - Offline inputs (merged if present):
papers/import.(csv|json|jsonl|bib),papers/arxiv_export.(csv|json|jsonl|bib),papers/imports/*.(csv|json|jsonl|bib). - Optional offline snowball inputs:
papers/snowball/*.(csv|json|jsonl|bib). - Online expansion requires network: use
--onlineand/or--snowball. - Online retrieval is best-effort: arXiv API can be flaky in some environments; the script will also attempt a Semantic Scholar route when needed.
- For LLM-agent topics, the script also performs a best-effort pinned arXiv id_list fetch (canonical classics like ReAct/Toolformer/Reflexion/Voyager/Tree-of-Thoughts + a small prior-survey seed set) so
ref.bibcan include must-cite anchors even when keyword search misses them. - If HTTPS/TLS to external domains is unstable, the Semantic Scholar route is fetched via the
r.jina.aiproxy so the pipeline can still self-boot without manual exports. - When an online run returns
0records due to transient network errors, a simple rerun is often sufficient (the pipeline should not fabricate).
Examples
-
Offline imports only:
- Put exports under
papers/imports/then run:python .codex/skills/literature-engineer/scripts/run.py --workspace <ws>
- Put exports under
-
Explicit offline inputs (multi-route):
python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --input path/to/a.bib --input path/to/b.jsonl
-
Online arXiv retrieval (needs network):
python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --online
-
Snowballing (needs network unless you provide offline snowball exports):
python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --snowball
Troubleshooting
Issue: can't reach ≥1200 papers
Symptom:
papers/papers_raw.jsonlsize is far below target; later stages will fail mapping/bindings and citation density.
Causes:
- Only a small offline export was provided.
- Network is blocked so online retrieval/snowballing can't run.
Solutions:
- Provide additional exports under
papers/imports/(multiple routes/queries). - Provide snowball exports under
papers/snowball/. - Enable network and rerun with
--online --snowball.
Issue: many records missing stable IDs
Symptom:
- Report shows many entries with empty
arxiv_idanddoi.
Solutions:
- Prefer arXiv/OpenReview/ACL exports that include stable IDs.
- If you have network, rerun with
--onlineto backfill arXiv IDs. - Filter out ID-less entries before downstream citation generation.
Weekly Installs
30
Repository
willoscar/resea…e-skillsGitHub Stars
304
First Seen
Jan 23, 2026
Security Audits
Installed on
gemini-cli25
codex24
claude-code24
cursor23
opencode23
github-copilot20