audit-oe

Installation
SKILL.md

OpenEvidence Citation Audit v2

Independently verify every citation in an OpenEvidence response using parallel PubMed/bioRxiv/ClinicalTrials lookups, detect transitive citation errors, find cross-citation contradictions, and produce a structured accuracy report.

Prerequisites

  • OpenEvidence MCP server connected and authenticated
  • PubMed MCP server available (citation verification + full text)
  • bioRxiv MCP server available (preprint verification)
  • Clinical Trials MCP server available (endpoint verification)
  • WebFetch available (DOI resolution fallback)

Trigger

Use this skill when:

  • User asks to "audit", "verify", or "check" an OpenEvidence response
  • User asks to query OpenEvidence and validate its citations
  • User wants to critically appraise OE's evidence base

Input

Either:

  1. A topic/question -- skill queries OE then audits the response
  2. An existing OE article ID -- skill fetches and audits it directly

Output Structure

{topic-slug}/
  original.md          # Raw OE extracted answer
  assets/
    citations.bib      # BibTeX block from OE
  report.md            # Structured audit report with provenance + contradictions

Known OE Failure Modes (why this skill exists)

OE uses a full-text RAG system with these indexed corpora (decoded from ROT-1 obfuscated origin field):

Origin Key Decoded Risk
qvcnfe_bctusbdut_* pubmed_abstracts_hindex_35pct_ada3small Low -- abstract-level claims
kbdd_gvmmufyu_tdsbqfe jacc_fulltext_scraped Medium -- full-text chunks
mbodfu_gvmmufyu_tdsbqfe lancet_fulltext_scraped HIGH -- reviews quoting others
ofkn_sfwjfx_bsujdmf_* nejm_review_article_fulltext_sftp HIGH -- reviews quoting others
hvjefmjoft_gvmmufyu_* guidelines_fulltext_usa_manual Medium -- guideline recommendations
nfejb_boopubufe_hfnjoj media_annotated_gemini Medium -- AI-annotated figures

Primary failure mode: Transitive Citation. When a review (e.g., Nauck 2026) writes "GLP-1 RA reduce stroke by 13%" citing Kristensen 2019, OE retrieves that chunk and attributes the stroke finding to Nauck. But Nauck is just quoting -- it's not their finding. If another meta-analysis (Galli 2025) finds NO stroke benefit, OE doesn't detect the contradiction.


Workflow

Phase 1: Query OpenEvidence

1. Call oe_auth_status() -- abort if invalid
2. Call oe_ask with:
   - question: <user's topic>
   - include_bibtex: true
   - crossref_validate: true
   - wait_for_completion: true
   - timeout_sec: 120
3. Save extracted_answer_raw -> {topic-slug}/original.md
4. Save BibTeX block -> {topic-slug}/assets/citations.bib
5. Record: article_id, citationCount, crossrefValidatedCount

Phase 2: Parse, Map, and Decode Provenance

Orchestrator extracts from the structured_article (not just BibTeX):

# Access structured spans with citation metadata
sections = article.output.structured_article.articlesection_set
for section in sections:
    for para in section.articleparagraph_set:
        for span in para.articlespan_set:
            text = span.text
            for citation in span.citations:
                # Decode provenance
                raw_origin = citation.metadata.origin
                origin = ''.join(chr(ord(c) - 1) for c in raw_origin)
                impact = citation.metadata.why_cited.impact_score

For each citation, build an enhanced descriptor:

- index: N
  authors: "LastName et al."
  title: "..."
  journal: "..."
  year: YYYY
  doi: "10.xxxx/..."
  pmid: "NNNNNNN"
  claim_text: "The exact sentence from OE span"
  strategy: pubmed_pmid | pubmed_doi | pubmed_title | biorxiv | web_doi
  # NEW v2 fields:
  oe_origin: "lancet_fulltext_scraped"
  oe_impact_score: 21.54
  risk_level: HIGH | MEDIUM | LOW
  has_quantitative_claim: true  # contains HR, CI, %, p-value
  needs_transitive_check: true  # review/guideline + quantitative = yes

Risk level assignment:

  • HIGH: origin contains "fulltext_scraped" AND (study_type is review OR guideline)
  • MEDIUM: origin contains "fulltext" OR "media_annotated"
  • LOW: origin is "pubmed_abstracts"

Phase 3: Parallel Citation Verification (Enhanced)

Launch one Agent per citation using model: "haiku".

Each agent now has ENHANCED instructions:

Step A: Verify Paper Exists (unchanged)

  • get_article_metadata with PMID → search_articlesbioRxiv get_preprint → WebFetch DOI

Step B: Fetch Content (AGGRESSIVE)

1. get_article_metadata → check for PMCID
2. If PMCID exists: ALWAYS call get_full_text_article (don't skip)
3. If no PMCID and abstract is empty/generic:
   → WebFetch("https://doi.org/{DOI}", prompt="Extract the structured abstract, key findings, and conclusions")
4. If abstract mentions NCT number:
   → Call Clinical Trials MCP: get_trial_details(nct_id)
   → Extract primary/secondary endpoints, sample size, status

Step C: Score Claim Accuracy (Multi-dimensional)

Dimension Score Weight How to assess
Paper exists 0 or 1 required PubMed/DOI lookup
Metadata match 0-1 10% Author, year, journal correct?
Claim direction 0-1 25% Does paper support the direction of the claim?
Numbers verified 0-1 35% Specific HRs, CIs, % match?
Correct attribution 0-1 20% Is this the paper's OWN finding (not quoting another)?
No contradiction 0-1 10% Does any other evidence contradict? (filled in Phase 4)

Step D: Classify and Flag

- study_type: RCT | meta-analysis | cohort | review | guideline | editorial | preprint
- is_primary_source: true/false  # Did this paper GENERATE the data, or just CITE it?
- transitive_risk: true/false    # Review + quantitative claim about a specific trial
- trial_name_mentioned: "LEADER" | "SUSTAIN-6" | null  # For trace-back

Agent Output Format (v2)

CITATION_REPORT:
- citation_index: [i]
- exists: true/false
- existence_details: "..."
- correct_doi: "..."
- correct_pmid: "..."
- claim_text: "..."
- dimensions:
    metadata_match: [0-1]
    claim_direction: [0-1]
    numbers_verified: [0-1]
    correct_attribution: [0-1]
- composite_score: [0-1]
- study_type: "..."
- is_primary_source: true/false
- transitive_risk: true/false
- trial_name_mentioned: "..." or null
- sample_size: "..."
- journal: "..."
- peer_reviewed: true/false
- full_text_available: true/false
- data_source: abstract_only / full_text / doi_resolution / clinical_trials_registry
- key_findings_from_source: "..." # What the paper ACTUALLY found (for contradiction check)
- warnings: ["..."]
END_CITATION_REPORT

Phase 3b: Transitive Trace-Back (NEW)

Triggered for: citations where transitive_risk: true AND trial_name_mentioned is not null.

Launch additional haiku agents to find and verify the ORIGINAL source:

Prompt: "The review [Nauck 2026] claims 'stroke reduction 13-17%' 
         referencing what appears to be [LEADER / SUSTAIN-6 / Kristensen 2019].
         Search PubMed for the ORIGINAL trial/meta-analysis.
         Verify if the number matches the original source."

Output:

TRACEBACK_REPORT:
- original_citation_index: [i]
- traced_to_pmid: "..."
- traced_to_title: "..."
- traced_to_study_type: "RCT" or "meta-analysis"
- number_matches_original: true/false
- original_finding: "..."
- attribution_correct: true/false  # Should OE have cited the original instead?
END_TRACEBACK_REPORT

Phase 4: Cross-Citation Contradiction Scan (NEW)

Launch 1 sonnet-model agent that receives ALL Phase 3 + 3b reports and:

Instructions:
1. Read all CITATION_REPORT entries
2. Extract key_findings_from_source for each
3. For EACH quantitative claim in the OE response:
   - Check if multiple citations report DIFFERENT findings on the same outcome
   - Flag contradictions with severity:
     - CRITICAL: One source confirms, another explicitly denies
     - WARNING: Sources report different magnitudes (>20% difference)
     - NOTE: Sources use different populations/timeframes (may explain difference)
4. Produce a CONTRADICTION_REPORT

Output:

CONTRADICTION_REPORT:
- contradictions_found: N
- items:
  - outcome: "stroke reduction"
    claim_in_oe: "13-17% reduction"
    source_a: {pmid: X, finding: "no difference", origin: "jacc_fulltext"}
    source_b: {pmid: Y, finding: "13-17%", origin: "lancet_fulltext (quoting others)"}
    severity: CRITICAL
    explanation: "Source A is a comprehensive meta-analysis of 99,599 patients finding no stroke benefit. Source B is a review article quoting older, smaller meta-analyses."
END_CONTRADICTION_REPORT

Phase 5: Collation and Multi-Dimensional Report

Orchestrator computes:

Metric Formula
Citation Existence Rate exists_count / total x 100%
Mean Composite Score average(all composite_score)
Transitive Citation Rate transitive_risk_count / total x 100%
Contradiction Count from Phase 4
Full-Text Verification Rate full_text_count / total x 100%

Grading (v2)

Grade Criteria
PASS >= 90% exist AND mean composite >= 0.8 AND 0 CRITICAL contradictions
CAUTION >= 75% exist AND mean composite >= 0.6 AND <= 1 CRITICAL contradiction
FAIL < 75% exist OR mean composite < 0.6 OR > 1 CRITICAL contradictions OR fabricated citations

Report Template (v2)

# OpenEvidence Citation Audit Report

**Query:** "{question}"
**Date:** {YYYY-MM-DD}
**OE Article ID:** `{uuid}`
**OE Crossref Self-Validation:** {n}/{total}

## Executive Summary
| Metric | Result |
|--------|--------|
| Citations verified | X/N (Y%) |
| Mean composite score | Z.ZZ/1.0 |
| Transitive citations detected | M |
| Cross-citation contradictions | K (C critical) |
| Full-text verification rate | P% |

## Overall Assessment: {PASS|CAUTION|FAIL}

## Provenance Analysis
| # | Paper | OE Data Source | Risk Level | Primary Source? |
|---|-------|---------------|------------|-----------------|

## Cross-Citation Contradictions
| Outcome | OE Claims | Source A Says | Source B Says | Severity |
|---------|-----------|-------------|-------------|----------|

## Transitive Citation Trace
| # | OE Cites (Review) | Claim Actually From | Verified Against Original? |
|---|-------------------|--------------------|-----------------------------|

## Citation-by-Citation Results
| # | Paper | Exists | Composite | Direction | Numbers | Attribution | Flags |
|---|-------|--------|-----------|-----------|---------|-------------|-------|

## Detailed Findings
### Citation [i]: Author (Year) -- Score: X.XX
- Claim: "..."
- Verified: ...
- Provenance: {origin} (risk: {level})
- Attribution: Primary / Transitive (traced to: ...)
- Verdict: ...

## Evidence Strength Summary
## Methodology Notes

Error Handling

Failure Action
OE auth invalid Abort with clear message
DOI doesn't resolve Try PubMed title search, then WebSearch. Mark exists=false if all fail
Paper not in PubMed Try bioRxiv, then WebFetch on DOI
Paywalled (no PMCID) WebFetch on DOI for publisher abstract; note limitation
BibTeX metadata wrong Note discrepancy, still verify actual paper
Paper retracted Flag CRITICALLY in warnings
Preprint Flag peer_reviewed=false, check for published version
Non-academic source Skip verification, note in report
Agent timeout Retry once; if still fails, report partial results
OE structured data missing Fall back to BibTeX-only parsing (v1 behavior)
Origin field not decodable Mark provenance as "unknown", apply MEDIUM risk

Configuration

Parameter Default Description
model_verify haiku Model for Phase 3 verification agents
model_contradict sonnet Model for Phase 4 contradiction agent
timeout_sec 120 OE query timeout
max_parallel 15 Max concurrent verification agents
full_text aggressive Always attempt PMC + DOI fallback
trace_transitive true Run Phase 3b for reviews with quantitative claims
check_contradictions true Run Phase 4 cross-citation scan
clinical_trials true Verify RCT endpoints against registry
output_dir ./{topic-slug}/ Where to write results
Installs
9
GitHub Stars
11
First Seen
Apr 20, 2026