ref-hallucination-arena
SKILL.md
Reference Hallucination Arena Skill
Evaluate how accurately LLMs recommend real academic references using the
OpenJudge RefArenaPipeline:
- Load queries — from JSON/JSONL dataset
- Collect responses — BibTeX-formatted references from target models
- Extract references — parse BibTeX entries from model output
- Verify references — cross-check against Crossref / PubMed / arXiv / DBLP
- Score & rank — compute verification rate, per-field accuracy, discipline breakdown
- Generate report — Markdown report + visualization charts
Prerequisites
# Install OpenJudge
pip install py-openjudge
# Extra dependency for ref_hallucination_arena (chart generation)
pip install matplotlib
Gather from user before running
| Info | Required? | Notes |
|---|---|---|
| Config YAML path | Yes | Defines endpoints, dataset, verification settings |
| Dataset path | Yes | JSON/JSONL file with queries (can be set in config) |
| API keys | Yes | Env vars: OPENAI_API_KEY, DASHSCOPE_API_KEY, etc. |
| CrossRef email | No | Improves API rate limits for verification |
| PubMed API key | No | Improves PubMed rate limits |
| Output directory | No | Default: ./evaluation_results/ref_hallucination_arena |
| Report language | No | "en" (default) or "zh" |
| Tavily API key | No | Required only if using tool-augmented mode |
Quick start
CLI
# Run evaluation with config file
python -m cookbooks.ref_hallucination_arena --config config.yaml --save
# Resume from checkpoint (default behavior)
python -m cookbooks.ref_hallucination_arena --config config.yaml --save
# Start fresh, ignore checkpoint
python -m cookbooks.ref_hallucination_arena --config config.yaml --fresh --save
# Override output directory
python -m cookbooks.ref_hallucination_arena --config config.yaml \
--output_dir ./my_results --save
Python API
import asyncio
from cookbooks.ref_hallucination_arena.pipeline import RefArenaPipeline
async def main():
pipeline = RefArenaPipeline.from_config("config.yaml")
result = await pipeline.evaluate()
for rank, (model, score) in enumerate(result.rankings, 1):
print(f"{rank}. {model}: {score:.1%}")
asyncio.run(main())
CLI options
| Flag | Default | Description |
|---|---|---|
--config |
— | Path to YAML configuration file (required) |
--output_dir |
config value | Override output directory |
--save |
False |
Save results to file |
--fresh |
False |
Start fresh, ignore checkpoint |
Minimal config file
task:
description: "Evaluate LLM reference recommendation capabilities"
dataset:
path: "./data/queries.json"
target_endpoints:
model_a:
base_url: "https://api.openai.com/v1"
api_key: "${OPENAI_API_KEY}"
model: "gpt-4"
system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format. Only recommend papers you are confident actually exist."
model_b:
base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1"
api_key: "${DASHSCOPE_API_KEY}"
model: "qwen3-max"
system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format. Only recommend papers you are confident actually exist."
Full config reference
task
| Field | Required | Description |
|---|---|---|
description |
Yes | Evaluation task description |
scenario |
No | Usage scenario |
dataset
| Field | Default | Description |
|---|---|---|
path |
— | Path to JSON/JSONL dataset file (required) |
shuffle |
false |
Shuffle queries before evaluation |
max_queries |
null |
Max queries to use (null = all) |
target_endpoints.<name>
| Field | Default | Description |
|---|---|---|
base_url |
— | API base URL (required) |
api_key |
— | API key, supports ${ENV_VAR} (required) |
model |
— | Model name (required) |
system_prompt |
built-in | System prompt; use {num_refs} placeholder |
max_concurrency |
5 |
Max concurrent requests for this endpoint |
extra_params |
— | Extra API request params (e.g. temperature) |
tool_config.enabled |
false |
Enable ReAct agent with Tavily web search |
tool_config.tavily_api_key |
env var | Tavily API key |
tool_config.max_iterations |
10 |
Max ReAct iterations (1–30) |
tool_config.search_depth |
"advanced" |
"basic" or "advanced" |
verification
| Field | Default | Description |
|---|---|---|
crossref_mailto |
— | Email for Crossref polite pool |
pubmed_api_key |
— | PubMed API key |
max_workers |
10 |
Concurrent verification threads (1–50) |
timeout |
30 |
Per-request timeout in seconds |
verified_threshold |
0.7 |
Min composite score to count as VERIFIED |
evaluation
| Field | Default | Description |
|---|---|---|
timeout |
120 |
Model API request timeout in seconds |
retry_times |
3 |
Number of retry attempts |
output
| Field | Default | Description |
|---|---|---|
output_dir |
./evaluation_results/ref_hallucination_arena |
Output directory |
save_queries |
true |
Save loaded queries |
save_responses |
true |
Save model responses |
save_details |
true |
Save verification details |
report
| Field | Default | Description |
|---|---|---|
enabled |
true |
Enable report generation |
language |
"zh" |
Report language: "zh" or "en" |
include_examples |
3 |
Examples per section (1–10) |
chart.enabled |
true |
Generate charts |
chart.orientation |
"vertical" |
"horizontal" or "vertical" |
chart.show_values |
true |
Show values on bars |
chart.highlight_best |
true |
Highlight best model |
Dataset format
Each query in the JSON/JSONL dataset:
{
"query": "Please recommend papers on Transformer architectures for NLP.",
"discipline": "computer_science",
"num_refs": 5,
"language": "en",
"year_constraint": {"min_year": 2020}
}
| Field | Required | Description |
|---|---|---|
query |
Yes | Prompt for reference recommendation |
discipline |
No | computer_science, biomedical, physics, chemistry, social_science, interdisciplinary, other |
num_refs |
No | Expected number of references (default: 5) |
language |
No | "zh" or "en" (default: "zh") |
year_constraint |
No | {"exact": 2023}, {"min_year": 2020}, {"max_year": 2015}, or {"min_year": 2020, "max_year": 2024} |
Official dataset: OpenJudge/ref-hallucination-arena
Interpreting results
Overall accuracy (verification rate):
- > 75% — Excellent: model rarely hallucinates references
- 60–75% — Good: most references are real, some fabrication
- 40–60% — Fair: significant hallucination, use with caution
- < 40% — Poor: model frequently fabricates references
Per-field accuracy:
title_accuracy— % of titles matching real papersauthor_accuracy— % of correct author listsyear_accuracy— % of correct publication yearsdoi_accuracy— % of valid DOIs
Verification status:
VERIFIED— title + author + year all exactly match a real paperSUSPECT— partial match (e.g. title matches but authors differ)NOT_FOUND— no match in any databaseERROR— API timeout or network failure
Ranking order: overall accuracy → year compliance rate → avg confidence → completeness
Output files
evaluation_results/ref_hallucination_arena/
├── evaluation_report.md # Detailed Markdown report
├── evaluation_results.json # Rankings, per-field accuracy, scores
├── verification_chart.png # Per-field accuracy bar chart
├── discipline_chart.png # Per-discipline accuracy chart
├── queries.json # Loaded evaluation queries
├── responses.json # Raw model responses
├── extracted_refs.json # Extracted BibTeX references
├── verification_results.json # Per-reference verification details
└── checkpoint.json # Pipeline checkpoint for resume
API key by model
| Model prefix | Environment variable |
|---|---|
gpt-*, o1-*, o3-* |
OPENAI_API_KEY |
claude-* |
ANTHROPIC_API_KEY |
qwen-*, dashscope/* |
DASHSCOPE_API_KEY |
deepseek-* |
DEEPSEEK_API_KEY |
| Custom endpoint | set api_key + base_url in config |
Additional resources
- Full config examples: cookbooks/ref_hallucination_arena/examples/
- Documentation: docs/validating_graders/ref_hallucination_arena.md
- Official dataset: HuggingFace
- Leaderboard: openjudge.me/leaderboard
Weekly Installs
5
Repository
agentscope-ai/openjudgeGitHub Stars
459
First Seen
8 days ago
Security Audits
Installed on
cline5
gemini-cli5
github-copilot5
codex5
kimi-cli5
cursor5