AI Scientist Evaluator

Use this skill when Codex should behave like a skeptical reviewer panel rather than a research generator. Evaluate completed outputs, not just plans.

Instructions

Confirm the request is evaluative. Use this skill to audit or compare existing outputs, not to perform the original research task.
Restate the exact task in one or two sentences so the review stays anchored to the real objective and required deliverables.
Inventory the submitted artifacts and note what is missing. Prefer primary artifacts over summaries:
- notebooks, code, scripts, and workflow files
- environment files, package versions, and runtime logs
- figures, tables, and manuscript drafts
- data provenance, accession lists, database versions, and citations
- benchmark results, hardware notes, and task constraints
Choose the closest task profile from references/task_profiles.md and load the matching weights from assets/default_weight_profiles.yaml. Use the primary scientific profile first for composite tasks, then add manuscript comments as a secondary layer.
Review with a four-person panel and synthesize a consensus:
- scientific validity reviewer
- computational and reproducibility reviewer
- domain biology reviewer
- writing and editorial reviewer
Apply hard gates before generous scoring. A submission is not publication-ready if required deliverables are missing, claims are not supported by visible outputs, provenance is untraceable, the core method is not rerunnable, or the submission solves an easier adjacent problem.
Interrogate the submission with the relevant sections of references/question_bank.md. Always include the universal questions, then add the profile-specific and multi-submission questions when needed.
Scan for integrity, rigor, and validity problems using references/red_flags.md. Penalize missing evidence, task drift, unsupported biological claims, fabricated identifiers, and unverifiable citations more than polished narrative.
Score each category on the anchored 0 to 5 scale in references/score_scale.md. Use references/category_definitions.md if category meaning is unclear. A score of 5 earns the full category weight.
Convert the category scores to a weighted total out of 100. Apply explicit penalties sparingly and explain them when they are not already captured by the category scores.
For multiple submissions, score each one independently before ranking. Use tie-breaks in this order:
- fewer integrity or reproducibility problems
- better satisfaction of the task's main objective
- stronger validation or benchmarking
- clearer limitation handling
- better writing only after science and evidence are settled
Produce a concise consensus verdict with actionable revisions. Ground the review in concrete evidence from files, notebook cells, figure numbers, accessions, parameters, and versioned tools whenever possible.
When a structured artifact is useful, start from assets/evaluation_template.json and validate the shape against assets/evaluation_schema.json. Use assets/report_template.md for markdown reports. For completed JSON reviews, you may aggregate rankings with python scripts/aggregate_reviews.py review1.json review2.json --out_md leaderboard.md.

Quick Reference

Task	Action
General scientific audit	Use profile `scientific-analysis`
Phylogenomics or comparative genomics review	Use profile `phylogenomics-comparative-genomics`
Viral functional genomics review	Use profile `viral-functional-genomics`
Methods or software benchmark review	Use profile `methods-software`
Manuscript or short communication review	Use profile `manuscript-packaging`
Pick scoring weights	Read `assets/default_weight_profiles.yaml`
Interpret category names	Read `references/category_definitions.md`
Ask evidence-forcing review questions	Read `references/question_bank.md`
Check integrity and rigor failures	Read `references/red_flags.md`
Score consistently	Read `references/score_scale.md`
Draft a report	Use `assets/report_template.md`
Produce structured JSON	Use `assets/evaluation_template.json` and `assets/evaluation_schema.json`
Rank finished JSON reviews	Run `python scripts/aggregate_reviews.py review1.json review2.json --out_md leaderboard.md`

Input Requirements

The original task statement, success criteria, and any explicit constraints
One or more completed submissions or artifacts to review
Enough evidence to audit claims when available:
- notebooks, code, scripts, workflows, or repositories
- figures, tables, and manuscript drafts
- environment files, runtime notes, and benchmark context
- accession lists, database versions, citations, and provenance notes
Submission names or IDs when comparing multiple AI scientists

If key artifacts are missing, continue the review and mark the evidence gap explicitly instead of pretending certainty.

Output

For a single submission, produce:

a verdict paragraph
a gate-check table
a weighted score table
reviewer panel comments by category
answers to the most important critical questions
required revisions
a final recommendation label

For multiple submissions, produce:

a consensus ranking table
per-submission totals and category scores
pairwise comparison notes
best-in-class awards for science, reproducibility, writing, and engineering
a winner with justification
a merge recommendation when combining strengths would outperform any one entry

Use these recommendation labels:

90-100: Outstanding / near publication-ready
75-89: Strong but needs minor to moderate revision
60-74: Promising but major revision needed
40-59: Weak / unreliable in important respects
<40: Not trustworthy for scientific use

Quality Gates

The review is anchored to the exact task rather than an easier adjacent one
Artifact inventory and missing evidence are stated explicitly
A task profile and weight set were chosen deliberately
Hard gates were checked before final scoring
Questions and red flags were grounded in the provided artifacts
Scores follow the anchored 0 to 5 scale and sum to a weighted total out of 100
Multi-submission rankings were done only after independent scoring
Final recommendations distinguish absent, flawed, weakly validated, and well-supported work

Examples

Example 1: Compare five AI scientist submissions

Use $ai-scientist-evaluator to review five AI scientist submissions for the
same task. Inspect notebooks, code, figures, runtime notes, and manuscripts.
Score each submission with the appropriate weight profile, answer the critical
questions, identify red flags, and produce a ranked consensus table with
best-in-class awards.

Example 2: Audit one submission for publication readiness

Use $ai-scientist-evaluator to review this AI scientist submission as if you are
a skeptical reviewer panel. Tell me whether the notebook and manuscript really
support the main claims, score the work, and list the revisions required before
I would trust it.

Example 3: Rank finished JSON evaluations

python scripts/aggregate_reviews.py review_a.json review_b.json --out_md leaderboard.md

Troubleshooting

Issue: The submission includes only a polished manuscript and no underlying artifacts. Solution: Continue the review, but mark reproducibility and claim-evidence gaps explicitly and do not award publication-ready status.

Issue: The task spans more than one domain profile. Solution: Score with the closest primary scientific profile first, then add manuscript or secondary-domain comments without inventing a new weight set unless the user asks for one.

Issue: Multiple submissions look close in total score. Solution: Break ties with integrity, task completion, validation strength, and limitation handling before writing quality.

Issue: A claim looks impressive but evidence is thin or missing. Solution: Penalize unsupported claims, cite the missing evidence directly, and keep the verdict skeptical.

ai-scientist-evaluator

AI Scientist Evaluator

Instructions

Quick Reference

Input Requirements

Output

Quality Gates

Examples

Example 1: Compare five AI scientist submissions

Example 2: Audit one submission for publication readiness

Example 3: Rank finished JSON evaluations

Troubleshooting

More from fmschulz/omics-skills

beautiful-data-viz

bio-phylogenomics

bio-protein-clustering-pangenome

plotly-dashboard-skill

bio-annotation

bio-foundation-housekeeping