ai-research-reproduction

Installation
SKILL.md

ai-research-reproduction

Use when

  • The user wants the agent to reproduce an AI paper repository.
  • The target is a code repository with a README, scripts, configs, or documented commands.
  • The goal is a minimal trustworthy run, not unlimited experimentation.
  • The user needs standardized outputs that another human or model can audit quickly.
  • The task spans more than one stage, such as intake plus setup, or setup plus execution plus reporting.

Do not use when

  • The task is a general literature review or paper summary.
  • The task is to design a new model, benchmark suite, or training pipeline from scratch.
  • The repository is not centered on AI or does not expose a documented reproduction path.
  • The user primarily wants a deep code refactor rather than README-first reproduction.
  • The user is explicitly asking for only one narrow phase that a sub-skill already covers cleanly.
  • The user is explicitly authorizing exploratory branch-only experimentation instead of trusted reproduction.

Success criteria

  • README is treated as the primary source of reproduction intent.
  • A minimum trustworthy target is selected and justified.
  • Documented inference is preferred over evaluation, and evaluation is preferred over training.
  • Any repo edits remain conservative, explicit, and auditable.
  • Assumptions, protocol deviations, and human decision points are surfaced rather than hidden.
  • repro_outputs/ is generated with consistent structure and stable machine-readable fields.
  • Final user-facing explanation is short and follows the user's language when practical.

Interaction and usability policy

  • Keep the workflow simple enough for a new user to understand quickly.
  • Prefer short, concrete plans over exhaustive research.
  • Expose commands, assumptions, blockers, and evidence.
  • Avoid turning the skill into an opaque automation layer.
  • Preserve a low learning cost for both humans and downstream agents.

Language policy

  • Human-readable Markdown outputs should follow the user's language when it is clear.
  • If the user's language is unclear, default to concise English.
  • Machine-readable fields, filenames, keys, and enum values stay in stable English.
  • Paths, package names, CLI commands, config keys, and code identifiers remain unchanged.

See references/language-policy.md.

Reproduction policy

Core priority order:

  1. documented inference
  2. documented evaluation
  3. documented training startup or partial verification
  4. full training only when the user explicitly asks later

Rules:

  • README-first: use repository files to clarify, not casually override, the README.
  • Aim for minimal trustworthy reproduction rather than maximum task coverage.
  • Treat smoke tests, startup verification, and early-step checks as valid training evidence when full training is not appropriate.
  • In trusted reproduction, a documented training command should first be checked through startup verification or a short monitoring window, then paused for explicit human confirmation before broader training continues.
  • In explicitly authorized explore-lane execution, the training record can continue without the trusted-lane confirmation pause, but it must stay isolated from trusted conclusions.
  • Record unresolved gaps rather than fabricating confidence.

Patch policy

  • Prefer no code changes.
  • Prefer safer adjustments first:
    • command-line arguments
    • environment variables
    • path fixes
    • dependency version fixes
    • dependency file fixes such as requirements.txt or environment.yml
  • Avoid changing:
    • model architecture
    • core inference semantics
    • core training logic
    • loss functions
    • experiment meaning
  • If repository files must change:
    • create a patch branch first using repro/YYYY-MM-DD-short-task
    • apply low-risk changes before medium-risk changes
    • avoid high-risk changes by default
    • commit only verified groups of changes
    • keep verified patch commits sparse, usually 0-2
    • use commit messages in the form repro: <scope> for documented <command>

See references/patch-policy.md.

Research safety boundary

  • Preserve experiment meaning over convenience.
  • Do not silently change dataset, split, checkpoint, preprocessing, metric, loss, or model semantics.
  • Distinguish direct evidence from inference and from user-approved decisions.
  • Prefer a recorded blocker over an unrecorded workaround.
  • Escalate for explicit human review before any change that could alter scientific meaning or reported conclusions.

See references/research-safety-principles.md.

Workflow

  1. Read README and repo signals.
  2. Call repo-intake-and-plan to scan the repository and extract documented commands.
  3. Select the smallest trustworthy reproduction target.
  4. Call env-and-assets-bootstrap to prepare environment assumptions and asset paths.
  5. Call analyze-project only when repo structure, insertion points, or suspicious implementation patterns need a read-only pass before continuing.
  6. Run a conservative smoke check or documented inference or evaluation command with minimal-run-and-audit.
  7. If the selected trustworthy target is documented training startup, short-run verification, or resume, hand execution to run-train instead of minimal-run-and-audit.
  8. When training is selected inside trusted reproduction, let run-train capture the startup evidence first, then surface a human review checkpoint before any fuller training claim.
  9. Stop for human review if protocol meaning, model semantics, or result interpretation would otherwise be changed implicitly.
  10. Use paper-context-resolver only if README and repo files leave a narrow reproduction-critical gap that blocks the current target.
  11. Never auto-route into explore-code or explore-run; exploration requires explicit user authorization.
  12. Write the standardized outputs with evidence, assumptions, deviations, and next safe action.
  13. Give the user a short final note in the user's language.

Required outputs

Always target:

repro_outputs/
  SUMMARY.md
  COMMANDS.md
  LOG.md
  status.json
  PATCHES.md   # only if patches were applied

Use the templates under assets/ and the field rules in references/output-spec.md.

Reporting policy

  • Put the shortest high-value summary in SUMMARY.md.
  • Put copyable commands in COMMANDS.md.
  • Put process evidence, assumptions, failures, and decisions in LOG.md.
  • Put durable machine-readable state in status.json.
  • Put branch, commit, validation, and README-fidelity impact in PATCHES.md when needed.
  • Distinguish verified facts from inferred guesses.

Maintainability notes

  • Keep this skill narrow: README-first AI repo reproduction only.
  • Push specialized logic into sub-skills or helper scripts.
  • Prefer stable templates and simple schemas over ad hoc prose.
  • Keep machine-readable outputs backward compatible when possible.
  • Add new evidence sources only when they improve auditability without raising learning cost.
  • Treat repo-intake-and-plan and paper-context-resolver as narrow helpers, not primary public entrypoints.
Related skills

More from lllllllama/ai-paper-reproduction-skill

Installs
140
GitHub Stars
10
First Seen
Apr 5, 2026