audit-oe
OpenEvidence Citation Audit v2
Independently verify every citation in an OpenEvidence response using parallel PubMed/bioRxiv/ClinicalTrials lookups, detect transitive citation errors, find cross-citation contradictions, and produce a structured accuracy report.
Prerequisites
- OpenEvidence MCP server connected and authenticated
- PubMed MCP server available (citation verification + full text)
- bioRxiv MCP server available (preprint verification)
- Clinical Trials MCP server available (endpoint verification)
- WebFetch available (DOI resolution fallback)
Trigger
Use this skill when:
- User asks to "audit", "verify", or "check" an OpenEvidence response
- User asks to query OpenEvidence and validate its citations
- User wants to critically appraise OE's evidence base
Input
Either:
- A topic/question -- skill queries OE then audits the response
- An existing OE article ID -- skill fetches and audits it directly
Output Structure
{topic-slug}/
original.md # Raw OE extracted answer
assets/
citations.bib # BibTeX block from OE
report.md # Structured audit report with provenance + contradictions
Known OE Failure Modes (why this skill exists)
OE uses a full-text RAG system with these indexed corpora (decoded from ROT-1 obfuscated origin field):
| Origin Key | Decoded | Risk |
|---|---|---|
qvcnfe_bctusbdut_* |
pubmed_abstracts_hindex_35pct_ada3small | Low -- abstract-level claims |
kbdd_gvmmufyu_tdsbqfe |
jacc_fulltext_scraped | Medium -- full-text chunks |
mbodfu_gvmmufyu_tdsbqfe |
lancet_fulltext_scraped | HIGH -- reviews quoting others |
ofkn_sfwjfx_bsujdmf_* |
nejm_review_article_fulltext_sftp | HIGH -- reviews quoting others |
hvjefmjoft_gvmmufyu_* |
guidelines_fulltext_usa_manual | Medium -- guideline recommendations |
nfejb_boopubufe_hfnjoj |
media_annotated_gemini | Medium -- AI-annotated figures |
Primary failure mode: Transitive Citation. When a review (e.g., Nauck 2026) writes "GLP-1 RA reduce stroke by 13%" citing Kristensen 2019, OE retrieves that chunk and attributes the stroke finding to Nauck. But Nauck is just quoting -- it's not their finding. If another meta-analysis (Galli 2025) finds NO stroke benefit, OE doesn't detect the contradiction.
Workflow
Phase 1: Query OpenEvidence
1. Call oe_auth_status() -- abort if invalid
2. Call oe_ask with:
- question: <user's topic>
- include_bibtex: true
- crossref_validate: true
- wait_for_completion: true
- timeout_sec: 120
3. Save extracted_answer_raw -> {topic-slug}/original.md
4. Save BibTeX block -> {topic-slug}/assets/citations.bib
5. Record: article_id, citationCount, crossrefValidatedCount
Phase 2: Parse, Map, and Decode Provenance
Orchestrator extracts from the structured_article (not just BibTeX):
# Access structured spans with citation metadata
sections = article.output.structured_article.articlesection_set
for section in sections:
for para in section.articleparagraph_set:
for span in para.articlespan_set:
text = span.text
for citation in span.citations:
# Decode provenance
raw_origin = citation.metadata.origin
origin = ''.join(chr(ord(c) - 1) for c in raw_origin)
impact = citation.metadata.why_cited.impact_score
For each citation, build an enhanced descriptor:
- index: N
authors: "LastName et al."
title: "..."
journal: "..."
year: YYYY
doi: "10.xxxx/..."
pmid: "NNNNNNN"
claim_text: "The exact sentence from OE span"
strategy: pubmed_pmid | pubmed_doi | pubmed_title | biorxiv | web_doi
# NEW v2 fields:
oe_origin: "lancet_fulltext_scraped"
oe_impact_score: 21.54
risk_level: HIGH | MEDIUM | LOW
has_quantitative_claim: true # contains HR, CI, %, p-value
needs_transitive_check: true # review/guideline + quantitative = yes
Risk level assignment:
HIGH: origin contains "fulltext_scraped" AND (study_type is review OR guideline)MEDIUM: origin contains "fulltext" OR "media_annotated"LOW: origin is "pubmed_abstracts"
Phase 3: Parallel Citation Verification (Enhanced)
Launch one Agent per citation using model: "haiku".
Each agent now has ENHANCED instructions:
Step A: Verify Paper Exists (unchanged)
get_article_metadatawith PMID →search_articles→bioRxiv get_preprint→ WebFetch DOI
Step B: Fetch Content (AGGRESSIVE)
1. get_article_metadata → check for PMCID
2. If PMCID exists: ALWAYS call get_full_text_article (don't skip)
3. If no PMCID and abstract is empty/generic:
→ WebFetch("https://doi.org/{DOI}", prompt="Extract the structured abstract, key findings, and conclusions")
4. If abstract mentions NCT number:
→ Call Clinical Trials MCP: get_trial_details(nct_id)
→ Extract primary/secondary endpoints, sample size, status
Step C: Score Claim Accuracy (Multi-dimensional)
| Dimension | Score | Weight | How to assess |
|---|---|---|---|
| Paper exists | 0 or 1 | required | PubMed/DOI lookup |
| Metadata match | 0-1 | 10% | Author, year, journal correct? |
| Claim direction | 0-1 | 25% | Does paper support the direction of the claim? |
| Numbers verified | 0-1 | 35% | Specific HRs, CIs, % match? |
| Correct attribution | 0-1 | 20% | Is this the paper's OWN finding (not quoting another)? |
| No contradiction | 0-1 | 10% | Does any other evidence contradict? (filled in Phase 4) |
Step D: Classify and Flag
- study_type: RCT | meta-analysis | cohort | review | guideline | editorial | preprint
- is_primary_source: true/false # Did this paper GENERATE the data, or just CITE it?
- transitive_risk: true/false # Review + quantitative claim about a specific trial
- trial_name_mentioned: "LEADER" | "SUSTAIN-6" | null # For trace-back
Agent Output Format (v2)
CITATION_REPORT:
- citation_index: [i]
- exists: true/false
- existence_details: "..."
- correct_doi: "..."
- correct_pmid: "..."
- claim_text: "..."
- dimensions:
metadata_match: [0-1]
claim_direction: [0-1]
numbers_verified: [0-1]
correct_attribution: [0-1]
- composite_score: [0-1]
- study_type: "..."
- is_primary_source: true/false
- transitive_risk: true/false
- trial_name_mentioned: "..." or null
- sample_size: "..."
- journal: "..."
- peer_reviewed: true/false
- full_text_available: true/false
- data_source: abstract_only / full_text / doi_resolution / clinical_trials_registry
- key_findings_from_source: "..." # What the paper ACTUALLY found (for contradiction check)
- warnings: ["..."]
END_CITATION_REPORT
Phase 3b: Transitive Trace-Back (NEW)
Triggered for: citations where transitive_risk: true AND trial_name_mentioned is not null.
Launch additional haiku agents to find and verify the ORIGINAL source:
Prompt: "The review [Nauck 2026] claims 'stroke reduction 13-17%'
referencing what appears to be [LEADER / SUSTAIN-6 / Kristensen 2019].
Search PubMed for the ORIGINAL trial/meta-analysis.
Verify if the number matches the original source."
Output:
TRACEBACK_REPORT:
- original_citation_index: [i]
- traced_to_pmid: "..."
- traced_to_title: "..."
- traced_to_study_type: "RCT" or "meta-analysis"
- number_matches_original: true/false
- original_finding: "..."
- attribution_correct: true/false # Should OE have cited the original instead?
END_TRACEBACK_REPORT
Phase 4: Cross-Citation Contradiction Scan (NEW)
Launch 1 sonnet-model agent that receives ALL Phase 3 + 3b reports and:
Instructions:
1. Read all CITATION_REPORT entries
2. Extract key_findings_from_source for each
3. For EACH quantitative claim in the OE response:
- Check if multiple citations report DIFFERENT findings on the same outcome
- Flag contradictions with severity:
- CRITICAL: One source confirms, another explicitly denies
- WARNING: Sources report different magnitudes (>20% difference)
- NOTE: Sources use different populations/timeframes (may explain difference)
4. Produce a CONTRADICTION_REPORT
Output:
CONTRADICTION_REPORT:
- contradictions_found: N
- items:
- outcome: "stroke reduction"
claim_in_oe: "13-17% reduction"
source_a: {pmid: X, finding: "no difference", origin: "jacc_fulltext"}
source_b: {pmid: Y, finding: "13-17%", origin: "lancet_fulltext (quoting others)"}
severity: CRITICAL
explanation: "Source A is a comprehensive meta-analysis of 99,599 patients finding no stroke benefit. Source B is a review article quoting older, smaller meta-analyses."
END_CONTRADICTION_REPORT
Phase 5: Collation and Multi-Dimensional Report
Orchestrator computes:
| Metric | Formula |
|---|---|
| Citation Existence Rate | exists_count / total x 100% |
| Mean Composite Score | average(all composite_score) |
| Transitive Citation Rate | transitive_risk_count / total x 100% |
| Contradiction Count | from Phase 4 |
| Full-Text Verification Rate | full_text_count / total x 100% |
Grading (v2)
| Grade | Criteria |
|---|---|
| PASS | >= 90% exist AND mean composite >= 0.8 AND 0 CRITICAL contradictions |
| CAUTION | >= 75% exist AND mean composite >= 0.6 AND <= 1 CRITICAL contradiction |
| FAIL | < 75% exist OR mean composite < 0.6 OR > 1 CRITICAL contradictions OR fabricated citations |
Report Template (v2)
# OpenEvidence Citation Audit Report
**Query:** "{question}"
**Date:** {YYYY-MM-DD}
**OE Article ID:** `{uuid}`
**OE Crossref Self-Validation:** {n}/{total}
## Executive Summary
| Metric | Result |
|--------|--------|
| Citations verified | X/N (Y%) |
| Mean composite score | Z.ZZ/1.0 |
| Transitive citations detected | M |
| Cross-citation contradictions | K (C critical) |
| Full-text verification rate | P% |
## Overall Assessment: {PASS|CAUTION|FAIL}
## Provenance Analysis
| # | Paper | OE Data Source | Risk Level | Primary Source? |
|---|-------|---------------|------------|-----------------|
## Cross-Citation Contradictions
| Outcome | OE Claims | Source A Says | Source B Says | Severity |
|---------|-----------|-------------|-------------|----------|
## Transitive Citation Trace
| # | OE Cites (Review) | Claim Actually From | Verified Against Original? |
|---|-------------------|--------------------|-----------------------------|
## Citation-by-Citation Results
| # | Paper | Exists | Composite | Direction | Numbers | Attribution | Flags |
|---|-------|--------|-----------|-----------|---------|-------------|-------|
## Detailed Findings
### Citation [i]: Author (Year) -- Score: X.XX
- Claim: "..."
- Verified: ...
- Provenance: {origin} (risk: {level})
- Attribution: Primary / Transitive (traced to: ...)
- Verdict: ...
## Evidence Strength Summary
## Methodology Notes
Error Handling
| Failure | Action |
|---|---|
| OE auth invalid | Abort with clear message |
| DOI doesn't resolve | Try PubMed title search, then WebSearch. Mark exists=false if all fail |
| Paper not in PubMed | Try bioRxiv, then WebFetch on DOI |
| Paywalled (no PMCID) | WebFetch on DOI for publisher abstract; note limitation |
| BibTeX metadata wrong | Note discrepancy, still verify actual paper |
| Paper retracted | Flag CRITICALLY in warnings |
| Preprint | Flag peer_reviewed=false, check for published version |
| Non-academic source | Skip verification, note in report |
| Agent timeout | Retry once; if still fails, report partial results |
| OE structured data missing | Fall back to BibTeX-only parsing (v1 behavior) |
| Origin field not decodable | Mark provenance as "unknown", apply MEDIUM risk |
Configuration
| Parameter | Default | Description |
|---|---|---|
| model_verify | haiku | Model for Phase 3 verification agents |
| model_contradict | sonnet | Model for Phase 4 contradiction agent |
| timeout_sec | 120 | OE query timeout |
| max_parallel | 15 | Max concurrent verification agents |
| full_text | aggressive | Always attempt PMC + DOI fallback |
| trace_transitive | true | Run Phase 3b for reviews with quantitative claims |
| check_contradictions | true | Run Phase 4 cross-citation scan |
| clinical_trials | true | Verify RCT endpoints against registry |
| output_dir | ./{topic-slug}/ |
Where to write results |