ai-scientist-evaluator
AI Scientist Evaluator
Use this skill when Codex should behave like a skeptical reviewer panel rather than a research generator. Evaluate completed outputs, not just plans.
Instructions
- Confirm the request is evaluative. Use this skill to audit or compare existing outputs, not to perform the original research task.
- Restate the exact task in one or two sentences so the review stays anchored to the real objective and required deliverables.
- Inventory the submitted artifacts and note what is missing. Prefer primary
artifacts over summaries:
- notebooks, code, scripts, and workflow files
- environment files, package versions, and runtime logs
- figures, tables, and manuscript drafts
- data provenance, accession lists, database versions, and citations
- benchmark results, hardware notes, and task constraints
- Choose the closest task profile from
references/task_profiles.mdand load the matching weights fromassets/default_weight_profiles.yaml. Use the primary scientific profile first for composite tasks, then add manuscript comments as a secondary layer. - Review with a four-person panel and synthesize a consensus:
- scientific validity reviewer
- computational and reproducibility reviewer
- domain biology reviewer
- writing and editorial reviewer
- Apply hard gates before generous scoring. A submission is not publication-ready if required deliverables are missing, claims are not supported by visible outputs, provenance is untraceable, the core method is not rerunnable, or the submission solves an easier adjacent problem.
- Interrogate the submission with the relevant sections of
references/question_bank.md. Always include the universal questions, then add the profile-specific and multi-submission questions when needed. - Scan for integrity, rigor, and validity problems using
references/red_flags.md. Penalize missing evidence, task drift, unsupported biological claims, fabricated identifiers, and unverifiable citations more than polished narrative. - Score each category on the anchored 0 to 5 scale in
references/score_scale.md. Usereferences/category_definitions.mdif category meaning is unclear. A score of 5 earns the full category weight. - Convert the category scores to a weighted total out of 100. Apply explicit penalties sparingly and explain them when they are not already captured by the category scores.
- For multiple submissions, score each one independently before ranking. Use
tie-breaks in this order:
- fewer integrity or reproducibility problems
- better satisfaction of the task's main objective
- stronger validation or benchmarking
- clearer limitation handling
- better writing only after science and evidence are settled
- Produce a concise consensus verdict with actionable revisions. Ground the review in concrete evidence from files, notebook cells, figure numbers, accessions, parameters, and versioned tools whenever possible.
- When a structured artifact is useful, start from
assets/evaluation_template.jsonand validate the shape againstassets/evaluation_schema.json. Useassets/report_template.mdfor markdown reports. For completed JSON reviews, you may aggregate rankings withpython scripts/aggregate_reviews.py review1.json review2.json --out_md leaderboard.md.
Quick Reference
| Task | Action |
|---|---|
| General scientific audit | Use profile scientific-analysis |
| Phylogenomics or comparative genomics review | Use profile phylogenomics-comparative-genomics |
| Viral functional genomics review | Use profile viral-functional-genomics |
| Methods or software benchmark review | Use profile methods-software |
| Manuscript or short communication review | Use profile manuscript-packaging |
| Pick scoring weights | Read assets/default_weight_profiles.yaml |
| Interpret category names | Read references/category_definitions.md |
| Ask evidence-forcing review questions | Read references/question_bank.md |
| Check integrity and rigor failures | Read references/red_flags.md |
| Score consistently | Read references/score_scale.md |
| Draft a report | Use assets/report_template.md |
| Produce structured JSON | Use assets/evaluation_template.json and assets/evaluation_schema.json |
| Rank finished JSON reviews | Run python scripts/aggregate_reviews.py review1.json review2.json --out_md leaderboard.md |
Input Requirements
- The original task statement, success criteria, and any explicit constraints
- One or more completed submissions or artifacts to review
- Enough evidence to audit claims when available:
- notebooks, code, scripts, workflows, or repositories
- figures, tables, and manuscript drafts
- environment files, runtime notes, and benchmark context
- accession lists, database versions, citations, and provenance notes
- Submission names or IDs when comparing multiple AI scientists
If key artifacts are missing, continue the review and mark the evidence gap explicitly instead of pretending certainty.
Output
For a single submission, produce:
- a verdict paragraph
- a gate-check table
- a weighted score table
- reviewer panel comments by category
- answers to the most important critical questions
- required revisions
- a final recommendation label
For multiple submissions, produce:
- a consensus ranking table
- per-submission totals and category scores
- pairwise comparison notes
- best-in-class awards for science, reproducibility, writing, and engineering
- a winner with justification
- a merge recommendation when combining strengths would outperform any one entry
Use these recommendation labels:
90-100: Outstanding / near publication-ready75-89: Strong but needs minor to moderate revision60-74: Promising but major revision needed40-59: Weak / unreliable in important respects<40: Not trustworthy for scientific use
Quality Gates
- The review is anchored to the exact task rather than an easier adjacent one
- Artifact inventory and missing evidence are stated explicitly
- A task profile and weight set were chosen deliberately
- Hard gates were checked before final scoring
- Questions and red flags were grounded in the provided artifacts
- Scores follow the anchored 0 to 5 scale and sum to a weighted total out of 100
- Multi-submission rankings were done only after independent scoring
- Final recommendations distinguish absent, flawed, weakly validated, and well-supported work
Examples
Example 1: Compare five AI scientist submissions
Use $ai-scientist-evaluator to review five AI scientist submissions for the
same task. Inspect notebooks, code, figures, runtime notes, and manuscripts.
Score each submission with the appropriate weight profile, answer the critical
questions, identify red flags, and produce a ranked consensus table with
best-in-class awards.
Example 2: Audit one submission for publication readiness
Use $ai-scientist-evaluator to review this AI scientist submission as if you are
a skeptical reviewer panel. Tell me whether the notebook and manuscript really
support the main claims, score the work, and list the revisions required before
I would trust it.
Example 3: Rank finished JSON evaluations
python scripts/aggregate_reviews.py review_a.json review_b.json --out_md leaderboard.md
Troubleshooting
Issue: The submission includes only a polished manuscript and no underlying artifacts. Solution: Continue the review, but mark reproducibility and claim-evidence gaps explicitly and do not award publication-ready status.
Issue: The task spans more than one domain profile. Solution: Score with the closest primary scientific profile first, then add manuscript or secondary-domain comments without inventing a new weight set unless the user asks for one.
Issue: Multiple submissions look close in total score. Solution: Break ties with integrity, task completion, validation strength, and limitation handling before writing quality.
Issue: A claim looks impressive but evidence is thin or missing. Solution: Penalize unsupported claims, cite the missing evidence directly, and keep the verdict skeptical.
More from fmschulz/omics-skills
beautiful-data-viz
Create publication-quality matplotlib/seaborn charts with readable axes, tight layout, and curated palettes.
19bio-phylogenomics
Build marker gene alignments and phylogenetic trees.
19bio-protein-clustering-pangenome
Cluster proteins into orthogroups and derive pangenome matrices.
18plotly-dashboard-skill
Build production-ready Plotly Dash dashboards with consistent theming, clear layouts, and performant callbacks.
17bio-annotation
Functional annotation and taxonomy inference from sequence homology.
16bio-foundation-housekeeping
Initialize a bioinformatics project scaffold with reproducible environments, schemas, and data cataloging. Use for new projects or repo setup.
16