Traversing Citation Networks

Overview

Intelligently follow citations backward (references) and forward (citing papers) using Semantic Scholar API.

Core principle: Only follow citations relevant to user's query. Avoid exponential explosion by filtering before traversing.

When to Use

Use this skill when:

Found a highly relevant paper (score ≥ 7)
Need to find related work
User asks "what papers cite this?"
Building comprehensive understanding of a topic

When NOT to use:

Paper scored < 7 (not relevant enough to follow)
Already at 50 papers (check with user first)
Citations look off-topic from abstract

Citation Traversal Strategy

1. Get Paper ID from Semantic Scholar

Lookup by DOI:

curl "https://api.semanticscholar.org/graph/v1/paper/DOI:10.1234/example.2023?fields=paperId,title,year"

Response:

{
  "paperId": "abc123def456",
  "title": "Paper Title",
  "year": 2023
}

Save paperId - needed for citations/references queries

2. Backward Traversal (References)

Get references from paper:

curl "https://api.semanticscholar.org/graph/v1/paper/abc123def456/references?fields=contexts,intents,title,year,abstract,externalIds&limit=100"

Response format:

{
  "data": [
    {
      "citedPaper": {
        "paperId": "xyz789",
        "title": "Referenced Paper Title",
        "year": 2020,
        "abstract": "...",
        "externalIds": {
          "DOI": "10.5678/referenced.2020",
          "PubMed": "87654321"
        }
      },
      "contexts": [
        "...as described in previous work [15]...",
        "...we used the method from [15] to..."
      ],
      "intents": ["methodology", "background"]
    }
  ]
}

Filter for relevance:

For each reference, check:

Context keywords: Do citation contexts mention user's query terms?
- Example: If user asks about "IC50 values", look for contexts mentioning "IC50", "activity", "potency"
Title match: Does title contain relevant keywords?
Intent: Is intent "methodology" or "result" (more relevant) vs "background" (less relevant)?

Scoring:

Context keywords match: +3 points
Title keywords match: +2 points
Intent is methodology/result: +2 points
Recent (< 5 years old): +1 point

Only add to queue if score ≥ 5

3. Forward Traversal (Citations)

Get papers citing this one:

curl "https://api.semanticscholar.org/graph/v1/paper/abc123def456/citations?fields=title,year,abstract,externalIds&limit=100"

Response format:

{
  "data": [
    {
      "citingPaper": {
        "paperId": "def456ghi",
        "title": "Newer Paper Citing This",
        "year": 2024,
        "abstract": "We extended the work of [original paper]...",
        "externalIds": {
          "DOI": "10.9012/citing.2024"
        }
      }
    }
  ]
}

Filter for relevance:

For each citing paper:

Title match: Keywords present in title?
Abstract match: User's query terms in abstract?
Recency: Newer papers often build on findings (prioritize < 2 years)
Citation count: If Semantic Scholar provides, highly cited papers more likely relevant

Scoring:

Title keywords match: +3 points
Abstract keywords match: +2 points
Recent (< 2 years): +2 points
Moderate recency (2-5 years): +1 point

Only add to queue if score ≥ 5

4. Deduplication

Before adding to queue:

Check papers-reviewed.json:

doi = paper["externalIds"].get("DOI")
if doi in papers_reviewed:
    skip  # Already processed
else:
    add to queue

CRITICAL: After evaluating any paper from citation traversal, add it to papers-reviewed.json regardless of score. This prevents re-processing the same paper from multiple sources.

Track citation relationship in citations/citation-graph.json:

{
  "10.1234/example.2023": {
    "references": ["10.5678/ref1.2020", "10.5678/ref2.2021"],
    "cited_by": ["10.9012/cite1.2024", "10.9012/cite2.2024"]
  }
}

CRITICAL: Use ONLY citation-graph.json for citation tracking. Do NOT create custom files like forward_citation_pmids.txt or citation_analysis.md. All findings go in SUMMARY.md.

5. Process Queue

Add relevant citations to processing queue:

{
  "doi": "10.5678/referenced.2020",
  "title": "Referenced Paper",
  "relevance_score": 7,
  "source": "backward_from:10.1234/example.2023",
  "context": "Method citation - describes IC50 measurement protocol"
}

Then:

Evaluate using evaluating-paper-relevance skill
If relevant, extract data and potentially traverse its citations too

Smart Traversal Limits

To avoid explosion:

Only traverse papers scoring ≥ 7 in initial evaluation
Only follow citations scoring ≥ 5 in relevance filtering
Limit traversal depth to 2 levels (original → references → references of references)
Check with user after every 50 papers total

Breadth-first strategy:

Get all references + citations for current paper
Filter and score them
Add high-scoring ones to queue
Process next paper in queue
Repeat until queue empty or hit limit

Progress Reporting

Report as you traverse:

🔗 Analyzing citations for: "Original Paper Title"
   → Found 45 references, 12 look relevant
   → Found 23 citing papers, 8 look relevant
   → Adding 20 papers to queue

📄 [51/127] Following reference: "Method for measuring IC50"
   Source: Referenced by original paper in Methods section
   Abstract score: 7 → Fetching full text...

API Rate Limiting

Semantic Scholar limits:

Free tier: 100 requests per 5 minutes
With API key: 1000 requests per 5 minutes

Be efficient:

Request multiple fields in one call (?fields=title,abstract,externalIds,year)
Use limit=100 to get more results per request
Cache responses - don't re-fetch same paper

If rate limited:

Wait 5 minutes
Report to user: "⏸️ Rate limited by Semantic Scholar API. Waiting 5 minutes..."
Consider getting API key for higher limits

Integration with Other Skills

After traversing citations:

Queue now has N new papers to evaluate
For each, use evaluating-paper-relevance skill
If relevant, extract to SUMMARY.md
If highly relevant (≥9), traverse its citations too
Update citation-graph.json to track relationships

Quick Reference

Task	API Endpoint
Get paper by DOI	`GET /graph/v1/paper/DOI:{doi}?fields=paperId,title`
Get references	`GET /graph/v1/paper/{paperId}/references?fields=contexts,title,abstract,externalIds`
Get citations	`GET /graph/v1/paper/{paperId}/citations?fields=title,abstract,externalIds`
Check if processed	Look up DOI in papers-reviewed.json
Filter relevance	Score based on context/title/intent/recency

Relevance Filtering Checklist

Before adding citation to queue:

Check if already in papers-reviewed.json (skip if yes)
Score based on context/title keywords (need ≥ 5)
Verify external ID (DOI or PMID) exists
Add source tracking ("backward_from:DOI" or "forward_from:DOI")
Add to queue with metadata

Common Mistakes

Not tracking all evaluated papers: Only adding relevant papers to papers-reviewed.json → Add EVERY paper after evaluation to prevent re-review Creating custom analysis files: Making forward_citation_pmids.txt, CITATION_ANALYSIS.md, etc. → Use ONLY citation-graph.json and SUMMARY.md Following all citations: Exponential explosion → Filter before adding to queue Ignoring context: Citation might be tangential → Read context strings Not deduplicating: Re-process same papers → Always check papers-reviewed.json before and after evaluation Too deep: Following 5+ levels → Limit to 2 levels, check with user Missing forward citations: Only checking references → Use both backward and forward No rate limiting awareness: API blocks you → Add delays, handle 429 errors

Example Workflow

1. User asks: "Find selectivity data for BTK inhibitors"
2. Search finds Paper A (score: 9, has great IC50 data)
3. Traverse citations for Paper A:
   - References: 45 total, 12 relevant (mention "selectivity", "IC50")
   - Citations: 23 total, 8 relevant (newer papers on BTK)
4. Add 20 papers to queue
5. Evaluate first queued paper (score: 8)
6. Extract data, traverse its citations (add 5 more)
7. Continue until queue empty or user says stop

Next Steps

After traversing citations:

Process queued papers with evaluating-paper-relevance
Update SUMMARY.md with new findings
Check if reached checkpoint (50 papers or 5 minutes)
If checkpoint: ask user to continue or stop