literature
Literature Skill
CRITICAL RULE: Every citation must be verified to exist before inclusion. Never include a paper you cannot find via web search. Hallucinated citations are worse than no citations.
DOI INTEGRITY RULE: Every DOI must be programmatically verified before entering any .bib file. Sub-agents hallucinate plausible-looking DOIs that resolve to wrong papers (e.g., correct journal prefix, wrong suffix). The ONLY reliable verification is scholarly_verify_dois with title-matching (see Phase 4). A DOI that resolves to a different title than expected is WRONG — treat it the same as a missing DOI.
CITATION KEY RULE: ALWAYS use Better BibTeX-format keys (e.g., Author2016-xx). When merging into an existing .bib, match existing keys. Never generate custom keys (AuthorYear, AuthorKamenica2017, etc.) or retain non-standard keys unless the user explicitly says otherwise.
Python: Always use uv run python. Never bare python, python3, pip, or pip3.
LIBRARY-FIRST RULE: ALWAYS check both Zotero (refpile MCP) and Paperpile (paperpile MCP) BEFORE any external search. Call mcp__refpile__search_library and mcp__paperpile__search_library for the topic in Phase 1. Do not skip this even if no .bib file exists yet. Papers already in either library should be reused, not re-discovered.
PREPRINT RULE: Always prefer the published version. If a paper is found on arXiv, SSRN, NBER, or any working paper series, search for a published journal/conference version using scholarly_search. Only cite a preprint if no published version can be found. This applies at every phase: Phase 2 (discovery), Phase 4 (verification), and Phase 6b (bib-validate runs the full preprint staleness check from bib-validate/references/preprint-check.md).
Comprehensive academic literature workflow: discover, verify, organize, synthesize. Uses parallel sub-agents to search multiple sources, verify citations, and fetch PDFs concurrently.
Shared References
- Concept validation gate:
shared/concept-validation-gate.md— validate concept before synthesis - Escalation protocol:
shared/escalation-protocol.md— escalate when research question is vague
When to Use
- Starting a new research project
- Writing a literature review section
- Building a reading list on a topic
- Finding specific citations
- Creating annotated bibliographies
Architecture: Orchestrator + Sub-Agents
You (orchestrator)
├── Phase 0: Session log & compact (mandatory — /session-log)
├── Phase 1: Pre-search check (direct — no sub-agent)
├── Phase 2: Parallel search (2-3 Explore agents)
├── Phase 2b: CLI Council search (optional — multi-model recall via cli-council)
├── Phase 3: Deduplicate + rank (direct — no sub-agent)
├── Phase 4: Parallel verification (general-purpose agents, batches of 5)
├── Phase 5: Parallel PDF download (Bash agents)
├── Phase 6: Assemble .bib (direct — no sub-agent)
├── Phase 6c: Sync to reference managers (Paperpile + Zotero via MCP)
└── Phase 7: Synthesize narrative (direct, or cli-council for multi-model synthesis)
Key principle: Sub-agents handle independent, parallelizable work. Merging, deduplication, and synthesis stay with you because they need the full picture.
Full agent prompt templates for all phases: references/agent-templates.md
Phase 0: Session Log & Compact (Mandatory)
Literature searches are context-heavy. Always run /session-log before starting to create a recovery checkpoint.
Phase 1: Pre-Search Check (Direct)
Check for existing .bib files in project root, /references, /bib, /bibliography:
- Parse existing entries to avoid duplicates and understand context
- Identify gaps — note if bibliography skews toward certain years/methods
- Compile list of existing citation keys to pass to sub-agents
- MANDATORY: Check Zotero library (active write target) — call
mcp__refpile__search_libraryfor the search topic. This finds papers the user already has, preventing re-discovery of known work. Mark these as ALREADY IN ZOTERO and reuse their citation keys. If refpile MCP is unavailable, log a warning and continue — but always attempt the call. - MANDATORY: Check Paperpile library (read-only cross-reference) — call
mcp__paperpile__search_libraryfor the search topic. Also callmcp__paperpile__get_items_by_labelif a relevant folder exists. Mark matches as ALREADY IN PAPERPILE. Items in Paperpile but not Zotero are flagged asMIGRATE_TO_ZOTEROcandidates. If Paperpile MCP is unavailable, log a warning and continue — but always attempt the call. - Resolve topic collection — read
zotero-collections.mdto find the collection key for the current topic (seeshared/reference-resolution.mdfor resolution logic). This key is used in Phase 6c for filing. - Check source availability — call
scholarly_source_status(bibliography MCP) to see which sources are active (OpenAlex always; Scopus and WoS if API keys are set). Report this so search agents know what coverage to expect.
Steps 4 and 5 are NOT optional. Every literature search must check both reference managers before external discovery. This prevents re-discovering papers already in the library and identifies migration candidates early.
Phase 2: Parallel Search (Sub-Agents)
MCP pre-fetch (main context, before spawning agents): Call these bibliography MCP tools from the main context before spawning agents. MCP tools are not available inside sub-agents — they are permission-scoped to the main conversation context only. Write results to /tmp/lit-search/.
scholarly_search— cross-source keyword search (OpenAlex + S2 + Scopus + WoS). Write to/tmp/lit-search/bibliography-results.json.scholarly_similar_works— ML-based recommendations (powered by S2 Recommendations API). Pass the topic description as text to find semantically related papers beyond keyword matches. Write to/tmp/lit-search/similar-results.json.scholarly_author_papers— if key authors are known, fetch their publication lists. Write to/tmp/lit-search/author-results.json.
Spawn 2-3 Explore agents in parallel in a single message, one per source. Read the full prompt templates from references/agent-templates.md.
Available search agents:
- Google Scholar — broad academic search via web (no MCP needed)
- Cross-Source via pre-fetched biblio data (recommended) — reads
/tmp/lit-search/bibliography-results.jsonand/tmp/lit-search/similar-results.json(pre-fetched by the orchestrator) and supplements with WebSearch - Semantic Scholar / arXiv (optional) — CS/ML focused, useful when topic has strong CS overlap (no MCP needed)
- Domain-specific (optional) — SSRN, NBER, specific journals (no MCP needed)
The MCP calls happen in the main context (Phase 2 pre-fetch), not inside sub-agents. Sub-agents read the pre-fetched results and supplement with web search.
Phase 2b: CLI Council Search (Optional)
Multi-model literature search via cli-council — runs the same query through Gemini, Codex, and Claude for maximum recall. Use for broad reviews (20+ papers) or interdisciplinary topics.
Full invocation, prompt template, and post-processing: references/cli-council-search.md
Phase 2.5: Snowball Search (Optional — Main Context)
After Phase 2 results are merged, use S2's citation graph to expand the candidate pool via snowballing. This finds seminal papers (backward) and recent follow-ups (forward) that keyword search misses.
- Identify seed papers — pick the 3-5 most relevant papers from Phase 2 results (highest citation count + relevance)
- Forward snowball — call
scholarly_citationsfor each seed to find papers that cite it. Useful for finding recent work building on foundational papers. - Backward snowball — call
scholarly_referencesfor each seed to find papers it cites. Useful for finding seminal/foundational works. - Filter — deduplicate against Phase 2 results, keep only papers with ≥5 citations (avoid noise)
- Add to candidate pool — merge into the main list before Phase 3 ranking
When to use: Literature reviews, broad topic surveys, or when Phase 2 returned <15 unique papers. Skip for narrow/targeted searches where the initial results are sufficient.
Paper detail enrichment: For top candidates, call scholarly_paper_detail to get TLDR summaries (one-line AI-generated descriptions) — useful for rapid screening without reading abstracts.
Phase 3: Deduplicate and Rank (Direct)
- Merge results from all search agents (Phase 2 + Phase 2b if used)
- Remove duplicates — match on title similarity and DOI
- Rank by relevance, citation count, and recency
- Select top N to verify (typically 25-30 candidates for 20-25 verified)
- Assign batches of ~5 for verification
Phase 4: Parallel Verification (Sub-Agents)
Step 1 — Batch DOI pre-verification via MCP: Collect all DOIs from Phase 3 candidates and call scholarly_verify_dois (bibliography MCP). This checks each DOI against all enabled sources (OpenAlex, Scopus, WoS). For each result:
- VERIFIED (2+ sources): Check that the returned title matches the expected paper. If the title doesn't match, the DOI is wrong — flag as DOI MISMATCH and find the correct DOI in Step 2.
- SINGLE_SOURCE: Needs manual verification — the DOI may be real but unconfirmed.
- NOT_FOUND: DOI is likely hallucinated. Find the correct DOI in Step 2.
Title-matching is mandatory. scholarly_verify_dois returns the title each DOI actually resolves to. Compare this against the title you expect. DOIs that are off by one character in the suffix (e.g., 02387 vs 02366, 2014.01.014 vs 2014.03.013) are the most common hallucination pattern — they resolve to real papers in the same journal but with different content.
Step 2 — Find correct DOIs for flagged papers: For any paper where the DOI was wrong, missing, or single-source, use these methods in order of reliability:
- Crossref API (most reliable):
curl -sL "https://api.crossref.org/works?query.bibliographic=[URL-encoded title+author]&rows=3"— returns the actual DOI from publisher metadata. scholarly_searchwith exact title — searches OpenAlex/Scopus/WoS for the paper.- Web search as last resort — but DOIs from web search must still be verified via
scholarly_verify_doisbefore use.
Step 3 — Manual verification for remaining papers: Spawn multiple general-purpose agents in parallel, each verifying ~5 papers. Read the full verification template from references/agent-templates.md. Include the Crossref instruction in the agent prompt — agents must use Crossref API (curl) for DOI lookup, not reconstruct DOIs from memory. Do NOT instruct sub-agents to call MCP tools (scholarly_search, scholarly_verify_dois) — MCP tools are not available in sub-agents. Sub-agents should use Crossref API and WebSearch/WebFetch only.
Key rules enforced by the template:
- DOI verification is mandatory (resolve and confirm)
- ALL authors must be listed (never "et al." in metadata)
- Preprint check: always search for published version; use
scholarly_searchMCP tool to find published versions of preprints - Results: VERIFIED / NOT FOUND / METADATA MISMATCH
Step 4 — Final DOI gate: Before proceeding to Phase 5/6, run scholarly_verify_dois one final time on ALL DOIs that will enter the .bib. This is the hard gate — no DOI enters a bibliography without passing this check with a matching title. Papers without DOIs (working papers, book chapters, old pre-DOI articles) are acceptable but must be explicitly flagged as % NO DOI in the .bib.
After all return: collect VERIFIED, drop NOT FOUND, check for remaining duplicates.
Phase 5: Parallel PDF Download (Sub-Agents)
Spawn Bash agents in parallel, 3-5 papers each. Read template from references/agent-templates.md. Best-effort — many papers are behind paywalls.
Phase 6: Assemble Bibliography (Direct)
Two outputs required:
docs/literature-review/literature_summary.bib— always created, standalone, self-contained- Project canonical bib (e.g.
paper/references.bib) — merge into it if it exists
BibTeX Format
@article{AuthorYear,
author = {Last, First and Last, First},
title = {Full Title},
journal = {Journal Name},
year = {2024},
volume = {XX},
pages = {1--20},
doi = {10.1000/example},
abstract = {Abstract text here.}
}
Rules:
- Citation keys: use Better BibTeX-format keys (e.g.,
Author2016-xx). If merging into an existing.bib, match the key format already in use. Never generateAuthorYearkeys. - Reuse existing Zotero citation keys — for entries marked ALREADY IN ZOTERO in Phase 1, use the Zotero
citationKeydirectly. Do not generate a new key. - Only VERIFIED papers — no METADATA MISMATCH entries
- List ALL authors explicitly — never "et al." in BibTeX
- Include abstracts when available
- S2 BibTeX seed: Call
scholarly_paper_detailfor each verified paper to get pre-formatted BibTeX via thecitationStylesfield. Use as a starting template, then enrich with missing fields (abstract, pages, volume) and correct the citation key to BBT format. This reduces manual entry errors.
Phase 6b: Validate Bibliography (Mandatory)
After assembling the .bib, always run /bib-validate. The Phase 4 verification checks that papers exist, but /bib-validate catches a different class of issues:
- Missing required BibTeX fields (journal, volume, pages)
- Preprint staleness (arXiv paper now published in a journal)
- Missing or incorrect DOIs
- Author formatting problems ("et al." in author field, corporate names needing braces)
- Unused entries and possible typos
This is not optional — every time new entries are added to a .bib file, run the validation before considering the bibliography complete.
Phase 6c: Sync to Reference Managers
After assembling and validating the .bib, sync new references to Zotero (active write target) and cross-reference with Paperpile (read-only). Handles migration candidates and post-run maintenance.
Full steps: references/reference-manager-sync.md
Phase 7: Synthesize Narrative (Direct or CLI Council)
- Identify themes — group papers by approach, finding, or debate
- Map intellectual lineage — how did thinking evolve?
- Note current debates — where do researchers disagree?
- Find gaps — what's missing?
Output types: narrative summary (LaTeX), literature deck, annotated bibliography, concise field synthesis.
Concise Field Synthesis (~400 words)
When the user asks for a "quick synthesis", "field overview", or "what does the literature say", produce a tight ~400-word synthesis instead of a full narrative. No paper-by-paper summaries — write about the field, not individual papers.
Structure:
- What the field collectively believes — established consensus (2-3 sentences)
- Where researchers disagree — active debates with camps identified (2-3 sentences)
- What has been proven — findings with strong, replicated evidence (2-3 sentences)
- The single most important unanswered question — one question, why it matters, why it's hard (2-3 sentences)
Cite papers parenthetically (Author, Year) but never summarise individual papers. The goal is a helicopter view that a newcomer could read in 2 minutes and understand where the field stands.
[VERIFY] Citation Tags
When synthesising, mark uncertain attributions with [VERIFY] tags for later resolution:
Meraz and Papacharissi (2013) argue that gatekeeping power shifted
from institutional positions to network centrality [VERIFY: exact claim on p. 12?].
- Drafting tier: [VERIFY] tags are acceptable — resolve before finalising
- Publication tier: All [VERIFY] tags must be resolved (read the actual source)
- Run
/bib-validateto catch any remaining [VERIFY] tags before submission
Multi-Model Synthesis (Optional)
For comprehensive literature reviews, run the synthesis through cli-council to get three independent interpretations of the literature landscape. Different models identify different themes, debates, and gaps.
cd "packages/cli-council"
uv run python -m cli_council \
--prompt-file /tmp/lit-synthesis-prompt.txt \
--context-file /tmp/lit-papers.txt \
--output-md /tmp/lit-synthesis-report.md \
--chairman claude \
--timeout 180
Where --context-file contains the verified paper list with titles, abstracts, and metadata, and the prompt asks for thematic grouping, intellectual lineage, and gap identification. The chairman synthesises three independent narratives into one.
Output Structure
project/
├── docs/
│ ├── literature-review/
│ │ ├── literature_summary.md # Thematic narrative (always)
│ │ └── literature_summary.bib # Standalone .bib (always)
│ └── readings/
│ ├── Smith2024.pdf # Downloaded PDFs
│ └── ...
└── paper/
└── references.bib # Canonical bib (merge if exists)
Sub-Agent Guidelines
- Python: ALWAYS use
uv run python. Include this in every sub-agent prompt. - Launch independent agents in a single message for parallelism
- Be explicit in prompts — sub-agents have no context
- Include skip lists of existing citation keys
- Batch sizes: 5 papers per verification agent, 3-5 per PDF agent
- Maximum 3 parallel agents at a time — spawn in waves, write results to disk between waves. Each agent should write to a temp file (e.g.,
/tmp/lit-search/agent-N.json) rather than returning large payloads in-context. Summarise from files to avoid context overflow. - Right agent type:
Explorefor search,general-purposefor verification,Bashfor downloads - Tolerate partial failures — continue with what you have
Bibliometric API Structured Queries
Four bibliometric sources are available. The bibliography MCP server (packages/mcp-bibliography/) is the preferred interface — scholarly_search queries all enabled sources in one call with automatic DOI-based dedup; scholarly_verify_dois batch-verifies DOIs across all sources.
Bibliography MCP Tools (preferred)
| Tool | What it does | When to use |
|---|---|---|
scholarly_search |
Cross-source keyword search (OpenAlex + S2 + Scopus + WoS) with dedup | Phase 2 pre-fetch |
scholarly_similar_works |
ML-based recommendations (S2 Recommendations API) | Phase 2 pre-fetch — finds papers beyond keyword matches |
scholarly_verify_dois |
Batch DOI verification across all sources | Phase 4 verification |
scholarly_citations |
Forward citation graph (papers citing a given paper) | Phase 2.5 snowball — find follow-up work |
scholarly_references |
Backward citation graph (papers referenced by a given paper) | Phase 2.5 snowball — find foundational works |
scholarly_paper_detail |
Full metadata + TLDR + BibTeX + OA PDF link | Phase 3 screening, Phase 6 BibTeX assembly |
scholarly_author_papers |
All papers by an author | Phase 2 pre-fetch — author-based search |
scholarly_source_status |
Check which sources are active | Phase 1 |
OpenAlex (always available)
Setup: .scripts/openalex/openalex_client.py + .scripts/openalex/query_helpers.py
| Workflow | What it does |
|---|---|
| Highly-cited papers | Top-cited papers on a topic (filtered by year) |
| Author output | Full publication record for a researcher |
| Institution output | Research output analysis for a university |
| Publication trends | Year-by-year counts for a topic |
| Open-access discovery | Find freely downloadable versions |
| Citation network | Forward citations for a given paper |
| Batch DOI lookup | Verify metadata for multiple papers |
Full recipes: references/openalex-workflows.md | API guide: references/openalex-api-guide.md
Scopus (requires SCOPUS_API_KEY + SCOPUS_INST_TOKEN)
Query syntax: TITLE-ABS-KEY("quoted phrases" OR terms), subject areas via SUBJAREA(CODE), year filters via PUBYEAR > N / PUBYEAR < N. Elsevier REST API with X-ELS-APIKey + X-ELS-Insttoken headers. Provides abstracts, author keywords, and citation counts in COMPLETE view. Pagination via start/count params (max 25 per page).
API guide: references/scopus-api-guide.md
Web of Science (requires WOS_API_KEY)
Query syntax: TS=(topic search), year filter via PY=(YYYY-YYYY). Two API tiers: Starter (/documents endpoint, page-based, max 50/page) and Expanded (root endpoint, firstRecord-based, max 100/page, includes abstracts). Auth via X-ApiKey header. Tier set via WOS_API_TIER env var (default: starter).
API guide: references/wos-api-guide.md
Reading Full Paper Text from arXiv
Download arXiv LaTeX source for full-text reading (equations, methodology, exact phrasing). Only works for arXiv papers with source available — for journal-only papers, use /split-pdf.
Full instructions: references/cli-council-search.md
Cross-References
| Skill / Package | When to use instead/alongside |
|---|---|
/scout generate |
Generate research questions first |
/interview-me |
Develop a specific idea before searching |
/bib-validate |
Mandatory after assembling .bib (Phase 6b) — metadata quality, preprint staleness, DOI checks |
/bib-coverage |
Compare project .bib vs Zotero topic collection — find uncited papers and unfiled references |
/split-pdf |
Deep-read a paper found during search |
cli-council |
Multi-model search (Phase 2b) and synthesis (Phase 7) — packages/cli-council/ |
refpile MCP |
Search personal Zotero library, extract PDF text/annotations, export BibTeX. Use in Phase 1 to check what's already in the library before searching externally. GROBID tools (parse_pdf_metadata, parse_pdf_references) extract structured metadata and bibliographies from PDFs — use after downloading to auto-extract refs without manual reading |
shared/reference-resolution.md |
Canonical lookup + filing sequence used by Phase 1 and Phase 6c |