Literature Skill

CRITICAL RULE: Every citation must be verified to exist before inclusion. Never include a paper you cannot find via web search. Hallucinated citations are worse than no citations.

DOI INTEGRITY RULE: Every DOI must be programmatically verified before entering any .bib file. Sub-agents hallucinate plausible-looking DOIs that resolve to wrong papers (e.g., correct journal prefix, wrong suffix). The ONLY reliable verification is scholarly_verify_dois with title-matching (see Phase 4). A DOI that resolves to a different title than expected is WRONG — treat it the same as a missing DOI.

CITATION KEY RULE: ALWAYS use Better BibTeX-format keys (e.g., Author2016-xx). When merging into an existing .bib, match existing keys. Never generate custom keys (AuthorYear, AuthorKamenica2017, etc.) or retain non-standard keys unless the user explicitly says otherwise.

Python: Always use uv run python. Never bare python, python3, pip, or pip3.

LIBRARY-FIRST RULE: ALWAYS check both Zotero (refpile MCP) and Paperpile (paperpile MCP) BEFORE any external search. Call mcp__refpile__search_library and mcp__paperpile__search_library for the topic in Phase 1. Do not skip this even if no .bib file exists yet. Papers already in either library should be reused, not re-discovered.

PREPRINT RULE: Always prefer the published version. If a paper is found on arXiv, SSRN, NBER, or any working paper series, search for a published journal/conference version using scholarly_search. Only cite a preprint if no published version can be found. This applies at every phase: Phase 2 (discovery), Phase 4 (verification), and Phase 6b (bib-validate runs the full preprint staleness check from bib-validate/references/preprint-check.md).

Comprehensive academic literature workflow: discover, verify, organize, synthesize. Uses parallel sub-agents to search multiple sources, verify citations, and fetch PDFs concurrently.

Shared References

Concept validation gate: shared/concept-validation-gate.md — validate concept before synthesis
Escalation protocol: shared/escalation-protocol.md — escalate when research question is vague

When to Use

Starting a new research project
Writing a literature review section
Building a reading list on a topic
Finding specific citations
Creating annotated bibliographies

Architecture: Orchestrator + Sub-Agents

You (orchestrator)
├── Phase 0: Session log & compact (mandatory — /session-log)
├── Phase 1: Pre-search check (direct — no sub-agent)
├── Phase 2: Parallel search (2-3 Explore agents)
├── Phase 2b: CLI Council search (optional — multi-model recall via cli-council)
├── Phase 3: Deduplicate + rank (direct — no sub-agent)
├── Phase 4: Parallel verification (general-purpose agents, batches of 5)
├── Phase 5: Parallel PDF download (Bash agents)
├── Phase 6: Assemble .bib (direct — no sub-agent)
├── Phase 6c: Sync to reference managers (Paperpile + Zotero via MCP)
└── Phase 7: Synthesize narrative (direct, or cli-council for multi-model synthesis)

Key principle: Sub-agents handle independent, parallelizable work. Merging, deduplication, and synthesis stay with you because they need the full picture.

Full agent prompt templates for all phases: references/agent-templates.md

Phase 0: Session Log & Compact (Mandatory)

Literature searches are context-heavy. Always run /session-log before starting to create a recovery checkpoint.

Phase 1: Pre-Search Check (Direct)

Check for existing .bib files in project root, /references, /bib, /bibliography:

Parse existing entries to avoid duplicates and understand context
Identify gaps — note if bibliography skews toward certain years/methods
Compile list of existing citation keys to pass to sub-agents
MANDATORY: Check Zotero library (active write target) — call mcp__refpile__search_library for the search topic. This finds papers the user already has, preventing re-discovery of known work. Mark these as ALREADY IN ZOTERO and reuse their citation keys. If refpile MCP is unavailable, log a warning and continue — but always attempt the call.
MANDATORY: Check Paperpile library (read-only cross-reference) — call mcp__paperpile__search_library for the search topic. Also call mcp__paperpile__get_items_by_label if a relevant folder exists. Mark matches as ALREADY IN PAPERPILE. Items in Paperpile but not Zotero are flagged as MIGRATE_TO_ZOTERO candidates. If Paperpile MCP is unavailable, log a warning and continue — but always attempt the call.
Resolve topic collection — read zotero-collections.md to find the collection key for the current topic (see shared/reference-resolution.md for resolution logic). This key is used in Phase 6c for filing.
Check source availability — call scholarly_source_status (bibliography MCP) to see which sources are active (OpenAlex always; Scopus and WoS if API keys are set). Report this so search agents know what coverage to expect.

Steps 4 and 5 are NOT optional. Every literature search must check both reference managers before external discovery. This prevents re-discovering papers already in the library and identifies migration candidates early.

Phase 2: Parallel Search (Sub-Agents)

MCP pre-fetch (main context, before spawning agents): Call these bibliography MCP tools from the main context before spawning agents. MCP tools are not available inside sub-agents — they are permission-scoped to the main conversation context only. Write results to /tmp/lit-search/.

scholarly_search — cross-source keyword search (OpenAlex + S2 + Scopus + WoS). Write to /tmp/lit-search/bibliography-results.json.
scholarly_similar_works — ML-based recommendations (powered by S2 Recommendations API). Pass the topic description as text to find semantically related papers beyond keyword matches. Write to /tmp/lit-search/similar-results.json.
scholarly_author_papers — if key authors are known, fetch their publication lists. Write to /tmp/lit-search/author-results.json.

Spawn 2-3 Explore agents in parallel in a single message, one per source. Read the full prompt templates from references/agent-templates.md.

Available search agents:

Google Scholar — broad academic search via web (no MCP needed)
Cross-Source via pre-fetched biblio data (recommended) — reads /tmp/lit-search/bibliography-results.json and /tmp/lit-search/similar-results.json (pre-fetched by the orchestrator) and supplements with WebSearch
Semantic Scholar / arXiv (optional) — CS/ML focused, useful when topic has strong CS overlap (no MCP needed)
Domain-specific (optional) — SSRN, NBER, specific journals (no MCP needed)

The MCP calls happen in the main context (Phase 2 pre-fetch), not inside sub-agents. Sub-agents read the pre-fetched results and supplement with web search.

Phase 2b: CLI Council Search (Optional)

Multi-model literature search via cli-council — runs the same query through Gemini, Codex, and Claude for maximum recall. Use for broad reviews (20+ papers) or interdisciplinary topics.

Full invocation, prompt template, and post-processing: references/cli-council-search.md

Phase 2.5: Snowball Search (Optional — Main Context)

After Phase 2 results are merged, use S2's citation graph to expand the candidate pool via snowballing. This finds seminal papers (backward) and recent follow-ups (forward) that keyword search misses.

Identify seed papers — pick the 3-5 most relevant papers from Phase 2 results (highest citation count + relevance)
Forward snowball — call scholarly_citations for each seed to find papers that cite it. Useful for finding recent work building on foundational papers.
Backward snowball — call scholarly_references for each seed to find papers it cites. Useful for finding seminal/foundational works.
Filter — deduplicate against Phase 2 results, keep only papers with ≥5 citations (avoid noise)
Add to candidate pool — merge into the main list before Phase 3 ranking

When to use: Literature reviews, broad topic surveys, or when Phase 2 returned <15 unique papers. Skip for narrow/targeted searches where the initial results are sufficient.

Paper detail enrichment: For top candidates, call scholarly_paper_detail to get TLDR summaries (one-line AI-generated descriptions) — useful for rapid screening without reading abstracts.

Phase 3: Deduplicate and Rank (Direct)

Merge results from all search agents (Phase 2 + Phase 2b if used)
Remove duplicates — match on title similarity and DOI
Rank by relevance, citation count, and recency
Select top N to verify (typically 25-30 candidates for 20-25 verified)
Assign batches of ~5 for verification

Phase 4: Parallel Verification (Sub-Agents)

Step 1 — Batch DOI pre-verification via MCP: Collect all DOIs from Phase 3 candidates and call scholarly_verify_dois (bibliography MCP). This checks each DOI against all enabled sources (OpenAlex, Scopus, WoS). For each result:

VERIFIED (2+ sources): Check that the returned title matches the expected paper. If the title doesn't match, the DOI is wrong — flag as DOI MISMATCH and find the correct DOI in Step 2.
SINGLE_SOURCE: Needs manual verification — the DOI may be real but unconfirmed.
NOT_FOUND: DOI is likely hallucinated. Find the correct DOI in Step 2.

Title-matching is mandatory. scholarly_verify_dois returns the title each DOI actually resolves to. Compare this against the title you expect. DOIs that are off by one character in the suffix (e.g., 02387 vs 02366, 2014.01.014 vs 2014.03.013) are the most common hallucination pattern — they resolve to real papers in the same journal but with different content.

Step 2 — Find correct DOIs for flagged papers: For any paper where the DOI was wrong, missing, or single-source, use these methods in order of reliability:

Crossref API (most reliable): curl -sL "https://api.crossref.org/works?query.bibliographic=[URL-encoded title+author]&rows=3" — returns the actual DOI from publisher metadata.
scholarly_search with exact title — searches OpenAlex/Scopus/WoS for the paper.
Web search as last resort — but DOIs from web search must still be verified via scholarly_verify_dois before use.

Step 3 — Manual verification for remaining papers: Spawn multiple general-purpose agents in parallel, each verifying ~5 papers. Read the full verification template from references/agent-templates.md. Include the Crossref instruction in the agent prompt — agents must use Crossref API (curl) for DOI lookup, not reconstruct DOIs from memory. Do NOT instruct sub-agents to call MCP tools (scholarly_search, scholarly_verify_dois) — MCP tools are not available in sub-agents. Sub-agents should use Crossref API and WebSearch/WebFetch only.

Key rules enforced by the template:

DOI verification is mandatory (resolve and confirm)
ALL authors must be listed (never "et al." in metadata)
Preprint check: always search for published version; use scholarly_search MCP tool to find published versions of preprints
Results: VERIFIED / NOT FOUND / METADATA MISMATCH

Step 4 — Final DOI gate: Before proceeding to Phase 5/6, run scholarly_verify_dois one final time on ALL DOIs that will enter the .bib. This is the hard gate — no DOI enters a bibliography without passing this check with a matching title. Papers without DOIs (working papers, book chapters, old pre-DOI articles) are acceptable but must be explicitly flagged as % NO DOI in the .bib.

After all return: collect VERIFIED, drop NOT FOUND, check for remaining duplicates.

Phase 5: Parallel PDF Download (Sub-Agents)

Spawn Bash agents in parallel, 3-5 papers each. Read template from references/agent-templates.md. Best-effort — many papers are behind paywalls.

Phase 6: Assemble Bibliography (Direct)

Two outputs required:

docs/literature-review/literature_summary.bib — always created, standalone, self-contained
Project canonical bib (e.g. paper/references.bib) — merge into it if it exists

BibTeX Format

@article{AuthorYear,
  author    = {Last, First and Last, First},
  title     = {Full Title},
  journal   = {Journal Name},
  year      = {2024},
  volume    = {XX},
  pages     = {1--20},
  doi       = {10.1000/example},
  abstract  = {Abstract text here.}
}

Rules:

Citation keys: use Better BibTeX-format keys (e.g., Author2016-xx). If merging into an existing .bib, match the key format already in use. Never generate AuthorYear keys.
Reuse existing Zotero citation keys — for entries marked ALREADY IN ZOTERO in Phase 1, use the Zotero citationKey directly. Do not generate a new key.
Only VERIFIED papers — no METADATA MISMATCH entries
List ALL authors explicitly — never "et al." in BibTeX
Include abstracts when available
S2 BibTeX seed: Call scholarly_paper_detail for each verified paper to get pre-formatted BibTeX via the citationStyles field. Use as a starting template, then enrich with missing fields (abstract, pages, volume) and correct the citation key to BBT format. This reduces manual entry errors.

Phase 6b: Validate Bibliography (Mandatory)

After assembling the .bib, always run /bib-validate. The Phase 4 verification checks that papers exist, but /bib-validate catches a different class of issues:

Missing required BibTeX fields (journal, volume, pages)
Preprint staleness (arXiv paper now published in a journal)
Missing or incorrect DOIs
Author formatting problems ("et al." in author field, corporate names needing braces)
Unused entries and possible typos

This is not optional — every time new entries are added to a .bib file, run the validation before considering the bibliography complete.

Phase 6c: Sync to Reference Managers

After assembling and validating the .bib, sync new references to Zotero (active write target) and cross-reference with Paperpile (read-only). Handles migration candidates and post-run maintenance.

Full steps: references/reference-manager-sync.md

Phase 7: Synthesize Narrative (Direct or CLI Council)

Identify themes — group papers by approach, finding, or debate
Map intellectual lineage — how did thinking evolve?
Note current debates — where do researchers disagree?
Find gaps — what's missing?

Output types: narrative summary (LaTeX), literature deck, annotated bibliography, concise field synthesis.

Concise Field Synthesis (~400 words)

When the user asks for a "quick synthesis", "field overview", or "what does the literature say", produce a tight ~400-word synthesis instead of a full narrative. No paper-by-paper summaries — write about the field, not individual papers.

Structure:

What the field collectively believes — established consensus (2-3 sentences)
Where researchers disagree — active debates with camps identified (2-3 sentences)
What has been proven — findings with strong, replicated evidence (2-3 sentences)
The single most important unanswered question — one question, why it matters, why it's hard (2-3 sentences)

Cite papers parenthetically (Author, Year) but never summarise individual papers. The goal is a helicopter view that a newcomer could read in 2 minutes and understand where the field stands.

[VERIFY] Citation Tags

When synthesising, mark uncertain attributions with [VERIFY] tags for later resolution:

Meraz and Papacharissi (2013) argue that gatekeeping power shifted
from institutional positions to network centrality [VERIFY: exact claim on p. 12?].

Drafting tier: [VERIFY] tags are acceptable — resolve before finalising
Publication tier: All [VERIFY] tags must be resolved (read the actual source)
Run /bib-validate to catch any remaining [VERIFY] tags before submission

Multi-Model Synthesis (Optional)

For comprehensive literature reviews, run the synthesis through cli-council to get three independent interpretations of the literature landscape. Different models identify different themes, debates, and gaps.

cd "packages/cli-council"
uv run python -m cli_council \
    --prompt-file /tmp/lit-synthesis-prompt.txt \
    --context-file /tmp/lit-papers.txt \
    --output-md /tmp/lit-synthesis-report.md \
    --chairman claude \
    --timeout 180

Where --context-file contains the verified paper list with titles, abstracts, and metadata, and the prompt asks for thematic grouping, intellectual lineage, and gap identification. The chairman synthesises three independent narratives into one.

Output Structure

project/
├── docs/
│   ├── literature-review/
│   │   ├── literature_summary.md      # Thematic narrative (always)
│   │   └── literature_summary.bib     # Standalone .bib (always)
│   └── readings/
│       ├── Smith2024.pdf              # Downloaded PDFs
│       └── ...
└── paper/
    └── references.bib                  # Canonical bib (merge if exists)

Sub-Agent Guidelines

Python: ALWAYS use uv run python. Include this in every sub-agent prompt.
Launch independent agents in a single message for parallelism
Be explicit in prompts — sub-agents have no context
Include skip lists of existing citation keys
Batch sizes: 5 papers per verification agent, 3-5 per PDF agent
Maximum 3 parallel agents at a time — spawn in waves, write results to disk between waves. Each agent should write to a temp file (e.g., /tmp/lit-search/agent-N.json) rather than returning large payloads in-context. Summarise from files to avoid context overflow.
Right agent type: Explore for search, general-purpose for verification, Bash for downloads
Tolerate partial failures — continue with what you have

Bibliometric API Structured Queries

Four bibliometric sources are available. The bibliography MCP server (packages/mcp-bibliography/) is the preferred interface — scholarly_search queries all enabled sources in one call with automatic DOI-based dedup; scholarly_verify_dois batch-verifies DOIs across all sources.

Bibliography MCP Tools (preferred)

Tool	What it does	When to use
`scholarly_search`	Cross-source keyword search (OpenAlex + S2 + Scopus + WoS) with dedup	Phase 2 pre-fetch
`scholarly_similar_works`	ML-based recommendations (S2 Recommendations API)	Phase 2 pre-fetch — finds papers beyond keyword matches
`scholarly_verify_dois`	Batch DOI verification across all sources	Phase 4 verification
`scholarly_citations`	Forward citation graph (papers citing a given paper)	Phase 2.5 snowball — find follow-up work
`scholarly_references`	Backward citation graph (papers referenced by a given paper)	Phase 2.5 snowball — find foundational works
`scholarly_paper_detail`	Full metadata + TLDR + BibTeX + OA PDF link	Phase 3 screening, Phase 6 BibTeX assembly
`scholarly_author_papers`	All papers by an author	Phase 2 pre-fetch — author-based search
`scholarly_source_status`	Check which sources are active	Phase 1

OpenAlex (always available)

Setup: .scripts/openalex/openalex_client.py + .scripts/openalex/query_helpers.py

Workflow	What it does
Highly-cited papers	Top-cited papers on a topic (filtered by year)
Author output	Full publication record for a researcher
Institution output	Research output analysis for a university
Publication trends	Year-by-year counts for a topic
Open-access discovery	Find freely downloadable versions
Citation network	Forward citations for a given paper
Batch DOI lookup	Verify metadata for multiple papers

Full recipes: references/openalex-workflows.md | API guide: references/openalex-api-guide.md

Scopus (requires `SCOPUS_API_KEY` + `SCOPUS_INST_TOKEN`)

Query syntax: TITLE-ABS-KEY("quoted phrases" OR terms), subject areas via SUBJAREA(CODE), year filters via PUBYEAR > N / PUBYEAR < N. Elsevier REST API with X-ELS-APIKey + X-ELS-Insttoken headers. Provides abstracts, author keywords, and citation counts in COMPLETE view. Pagination via start/count params (max 25 per page).

API guide: references/scopus-api-guide.md

Web of Science (requires `WOS_API_KEY`)

Query syntax: TS=(topic search), year filter via PY=(YYYY-YYYY). Two API tiers: Starter (/documents endpoint, page-based, max 50/page) and Expanded (root endpoint, firstRecord-based, max 100/page, includes abstracts). Auth via X-ApiKey header. Tier set via WOS_API_TIER env var (default: starter).

API guide: references/wos-api-guide.md

Reading Full Paper Text from arXiv

Download arXiv LaTeX source for full-text reading (equations, methodology, exact phrasing). Only works for arXiv papers with source available — for journal-only papers, use /split-pdf.

Full instructions: references/cli-council-search.md

Cross-References

Skill / Package	When to use instead/alongside
`/scout generate`	Generate research questions first
`/interview-me`	Develop a specific idea before searching
`/bib-validate`	Mandatory after assembling `.bib` (Phase 6b) — metadata quality, preprint staleness, DOI checks
`/bib-coverage`	Compare project `.bib` vs Zotero topic collection — find uncited papers and unfiled references
`/split-pdf`	Deep-read a paper found during search
`cli-council`	Multi-model search (Phase 2b) and synthesis (Phase 7) — `packages/cli-council/`
`refpile` MCP	Search personal Zotero library, extract PDF text/annotations, export BibTeX. Use in Phase 1 to check what's already in the library before searching externally. GROBID tools (`parse_pdf_metadata`, `parse_pdf_references`) extract structured metadata and bibliographies from PDFs — use after downloading to auto-extract refs without manual reading
`shared/reference-resolution.md`	Canonical lookup + filing sequence used by Phase 1 and Phase 6c

literature