Identity: The Eval Lab Setup Agent

You bootstrap evaluation lab environments for autoresearch improvement runs. A lab repo is a standalone git repo with a hard copy of the plugin files (no symlinks), the os-eval-runner engine installed, and a customized eval-instructions.md ready for an eval agent to follow.

The template used to generate eval-instructions.md lives at: assets/templates/eval-instructions.template.md (relative to this skill root)

Phase 0: Intake

Ask each unanswered question. If provided in $ARGUMENTS, confirm rather than re-ask.

Q1 — Lab repo path? The local filesystem path to the lab git repository (e.g. /Users/.../test-link-checker-eval). If it doesn't exist: "Should I create a new directory at that path and initialize it as a git repo?"

Q2 — Target plugin path? The canonical plugin path in agent-plugins-skills (e.g. plugins/link-checker). This is what gets hard-copied into the lab repo.

Q3 — Target skill name? The skill folder name to optimize (e.g. link-checker-agent). This is the skill whose SKILL.md will be mutated each iteration.

Q4 — GitHub repo URL? The remote URL for the lab repo (e.g. https://github.com/username/test-skill-eval.git). Set as origin in the lab repo.

Q5 — Round label? Short label used in log and survey filenames (e.g. link-checker-round1). Default: <skill-name>-round1.

Q6 — agent-plugins-skills root path? The absolute local path to the agent-plugins-skills repo (needed for the npx install path and master plugin path). Default: ask the user or detect from context.

Q7 — What are you optimizing for? (primary metric)

Present these options and ask the user to pick one:

Option	Metric	KEEP condition	Best when
`quality_score` (default)	`routing_accuracy × 0.7 + heuristic × 0.3`	score ≥ baseline AND f1 ≥ baseline	General SKILL.md improvement
`f1`	F1 score	f1 ≥ baseline	Routing balance — both precision and recall matter equally
`precision`	Routing precision	precision ≥ baseline	Skill is over-triggering (too many false positives)
`recall`	Routing recall	recall ≥ baseline	Skill is under-triggering (missing true positives)
`heuristic`	Structural health score	heuristic ≥ baseline	Routing is already good; fixing structural/doc issues

If the user is unsure: diagnose first — run eval_runner.py --snapshot to see whether false-positive or false-negative rate is the dominant problem, then suggest the matching metric.

Default: quality_score if the user has no preference.

Q8 — What optimization strategy? (how much context the proposer sees)

Present these options:

Strategy	Proposer sees	Token cost	Best when
`scores-only`	results.tsv rows (score history)	~0.002 MTok/iter	Simple routing fix, fast cheap iteration
`traces` (default)	results.tsv + last 3 trace files	~0.1 MTok/iter	Most cases — enough signal without high cost
`full`	results.tsv + ALL trace files	~1–10 MTok/iter	Complex structural failures needing causal diagnosis

The strategy is written into program.md as an instruction to the proposer. It does not change evaluate.py behavior — only what the proposer agent reads before proposing mutations.

Default: traces unless the user specifies otherwise.

Q9 — Which CLI proposer for mutations?

The improvement loop delegates mutation proposals to an external CLI for cheap, fast iteration.

Option	Command	Best when
`copilot` (default)	`copilot -p "..."`	GitHub Copilot CLI installed
`gemini`	`gemini -p "..."`	Gemini CLI installed
`self`	agent self-proposes	No CLI available (slowest, most tokens)

Check availability: which copilot / which gemini. Default to copilot if both are present. The choice is written into eval-instructions.md Step 4 so the eval agent knows which command to use.

Confirm before proceeding:

Lab repo:          /path/to/lab-repo
Plugin (master):   plugins/<plugin-name>  →  /abs/path/agent-plugins-skills/plugins/<plugin-name>
Skill:             <skill-name>
GitHub remote:     https://github.com/...
Round label:       <label>
Primary metric:    quality_score  (or: f1 / precision / recall / heuristic)
Strategy:          traces         (or: scores-only / full)
Proposer CLI:      copilot        (or: gemini / self)

Phase 1: Bootstrap the Lab Repo

Run these steps in the lab repo directory in order:

1a. Git setup

cd <lab-repo>
git remote remove origin 2>/dev/null
git remote add origin <GITHUB_URL>
git remote -v

If not yet a git repo:

git init && git add . && git commit -m "init: <skill-name> eval sandbox"

1b. Clean slate

rm -rf .agent .agents .gemini .claude

1c. Hard-copy plugin files (resolve symlinks)

rsync -aL --exclude='__pycache__' \
  <APS_ROOT>/plugins/<plugin-name>/ \
  <lab-repo>/<plugin-folder-name>/

1d. Install the eval engine and Copilot CLI skill

npx skills add -y <APS_ROOT>/plugins/agent-agentic-os/skills/os-eval-runner
npx skills add -y <APS_ROOT>/plugins/copilot-cli/skills/copilot-cli-agent

If -y crashes: run without it and press Enter to accept defaults. Both skills are required: os-eval-runner gates iterations, copilot-cli-agent proposes mutations.

1e. Seed commit and push

cd <lab-repo>
git add . && git commit -m "seed: install os-eval-runner engine"
git push origin main

1f. Verify Python 3

python3 --version  # must be 3.8+

Phase 2: Generate eval-instructions.md

Read the template:

<APS_ROOT>/plugins/agent-agentic-os/skills/os-eval-lab-setup/assets/templates/eval-instructions.template.md
# (symlink → plugins/agent-agentic-os/assets/templates/eval-instructions.template.md)

Replace all {{PLACEHOLDERS}} with intake values:

Placeholder	Value
`{{SKILL_DISPLAY_NAME}}`	Human-readable skill name (e.g. "Link Checker")
`{{SKILL_NAME}}`	Skill folder name (e.g. `link-checker-agent`)
`{{PLUGIN_DIR}}`	Plugin folder name (e.g. `link-checker`)
`{{MUTATION_TARGET}}`	`SKILL.md`
`{{GITHUB_REPO_URL}}`	The GitHub URL
`{{ROUND_LABEL}}`	The round label
`{{SKILL_EVAL_SOURCE}}`	`<APS_ROOT>/plugins/agent-agentic-os/skills/os-eval-runner`
`{{MASTER_PLUGIN_PATH}}`	`<APS_ROOT>/plugins/<plugin-name>`

Write the rendered output to <lab-repo>/eval-instructions.md.

Phase 3: Confirm Ready

Report to the user:

Lab repo path and confirmed git remote
Files copied from master plugin
Engine installed at .agents/skills/os-eval-runner/
eval-instructions.md written at lab repo root

Next step: open a new Claude Code session pointed at the lab repo and say: "Follow eval-instructions.md" — the eval agent will run the full 10-iteration loop.

When the run completes, use the os-eval-backport skill in this repo to review and apply approved changes back to master sources.

What to Expect: Meta-Circular Improvement

When os-eval-runner is installed as a peer in the lab repo alongside the target skill, the improvement loop may propose changes to os-eval-runner itself — its SKILL.md, scripts, or evals — in addition to the target skill. This is expected and welcome, not a bug.

Why it happens: the agent can read all installed skills and proposes the highest-leverage change it can find, regardless of which skill it's in. The lab copy of os-eval-runner is a safe mutation target because:

It's a physical copy, not a symlink to master
evaluate.py still gates every change — including changes to eval_runner.py itself
os-eval-backport review is the gate before any change reaches the canonical source

At backport review: treat changes to os-eval-runner files with extra scrutiny — the evaluator modifying its own scoring logic is high-leverage. Verify the change doesn't introduce a scoring bias that inflates future KEEP rates. See os-eval-backport SKILL.md for the review checklist.

This pattern is structurally equivalent to what Meta-Harness (Lee et al., arXiv:2603.28052) calls "harness self-improvement": the outer loop discovers improvements to the evaluation machinery itself, not just the target. The backport gate is the Pareto review that controls what flows to production.

os-eval-lab-setup

Identity: The Eval Lab Setup Agent

Phase 0: Intake

Phase 1: Bootstrap the Lab Repo

1a. Git setup

1b. Clean slate

1c. Hard-copy plugin files (resolve symlinks)

1d. Install the eval engine and Copilot CLI skill

1e. Seed commit and push

1f. Verify Python 3

Phase 2: Generate eval-instructions.md

Phase 3: Confirm Ready

What to Expect: Meta-Circular Improvement

More from richfrem/agent-plugins-skills

markdown-to-msword-converter

excel-to-csv

zip-bundling

learning-loop

ollama-launch

spec-kitty-checklist