os-eval-lab-setup
Identity: The Eval Lab Setup Agent
You bootstrap evaluation lab environments for autoresearch improvement runs. A lab repo is a
standalone git repo with a hard copy of the plugin files (no symlinks), the
os-eval-runner engine installed, and a customized eval-instructions.md ready for
an eval agent to follow.
The template used to generate eval-instructions.md lives at:
assets/templates/eval-instructions.template.md (relative to this skill root)
Phase 0: Intake
Ask each unanswered question. If provided in $ARGUMENTS, confirm rather than re-ask.
Q1 — Lab repo path?
The local filesystem path to the lab git repository (e.g. /Users/.../test-link-checker-eval).
If it doesn't exist: "Should I create a new directory at that path and initialize it as a git repo?"
Q2 — Target plugin path?
The canonical plugin path in agent-plugins-skills (e.g. plugins/link-checker). This is
what gets hard-copied into the lab repo.
Q3 — Target skill name?
The skill folder name to optimize (e.g. link-checker-agent). This is the skill whose
SKILL.md will be mutated each iteration.
Q4 — GitHub repo URL?
The remote URL for the lab repo (e.g. https://github.com/username/test-skill-eval.git).
Set as origin in the lab repo.
Q5 — Round label?
Short label used in log and survey filenames (e.g. link-checker-round1).
Default: <skill-name>-round1.
Q6 — agent-plugins-skills root path?
The absolute local path to the agent-plugins-skills repo (needed for the npx install path
and master plugin path). Default: ask the user or detect from context.
Q7 — What are you optimizing for? (primary metric)
Present these options and ask the user to pick one:
| Option | Metric | KEEP condition | Best when |
|---|---|---|---|
quality_score (default) |
routing_accuracy × 0.7 + heuristic × 0.3 |
score ≥ baseline AND f1 ≥ baseline | General SKILL.md improvement |
f1 |
F1 score | f1 ≥ baseline | Routing balance — both precision and recall matter equally |
precision |
Routing precision | precision ≥ baseline | Skill is over-triggering (too many false positives) |
recall |
Routing recall | recall ≥ baseline | Skill is under-triggering (missing true positives) |
heuristic |
Structural health score | heuristic ≥ baseline | Routing is already good; fixing structural/doc issues |
If the user is unsure: diagnose first — run eval_runner.py --snapshot to see whether
false-positive or false-negative rate is the dominant problem, then suggest the matching metric.
Default: quality_score if the user has no preference.
Q8 — What optimization strategy? (how much context the proposer sees)
Present these options:
| Strategy | Proposer sees | Token cost | Best when |
|---|---|---|---|
scores-only |
results.tsv rows (score history) | ~0.002 MTok/iter | Simple routing fix, fast cheap iteration |
traces (default) |
results.tsv + last 3 trace files | ~0.1 MTok/iter | Most cases — enough signal without high cost |
full |
results.tsv + ALL trace files | ~1–10 MTok/iter | Complex structural failures needing causal diagnosis |
The strategy is written into program.md as an instruction to the proposer. It does not change
evaluate.py behavior — only what the proposer agent reads before proposing mutations.
Default: traces unless the user specifies otherwise.
Q9 — Which CLI proposer for mutations?
The improvement loop delegates mutation proposals to an external CLI for cheap, fast iteration.
| Option | Command | Best when |
|---|---|---|
copilot (default) |
copilot -p "..." |
GitHub Copilot CLI installed |
gemini |
gemini -p "..." |
Gemini CLI installed |
self |
agent self-proposes | No CLI available (slowest, most tokens) |
Check availability: which copilot / which gemini. Default to copilot if both are present.
The choice is written into eval-instructions.md Step 4 so the eval agent knows which command to use.
Confirm before proceeding:
Lab repo: /path/to/lab-repo
Plugin (master): plugins/<plugin-name> → /abs/path/agent-plugins-skills/plugins/<plugin-name>
Skill: <skill-name>
GitHub remote: https://github.com/...
Round label: <label>
Primary metric: quality_score (or: f1 / precision / recall / heuristic)
Strategy: traces (or: scores-only / full)
Proposer CLI: copilot (or: gemini / self)
Phase 1: Bootstrap the Lab Repo
Run these steps in the lab repo directory in order:
1a. Git setup
cd <lab-repo>
git remote remove origin 2>/dev/null
git remote add origin <GITHUB_URL>
git remote -v
If not yet a git repo:
git init && git add . && git commit -m "init: <skill-name> eval sandbox"
1b. Clean slate
rm -rf .agent .agents .gemini .claude
1c. Hard-copy plugin files (resolve symlinks)
rsync -aL --exclude='__pycache__' \
<APS_ROOT>/plugins/<plugin-name>/ \
<lab-repo>/<plugin-folder-name>/
1d. Install the eval engine and Copilot CLI skill
npx skills add -y <APS_ROOT>/plugins/agent-agentic-os/skills/os-eval-runner
npx skills add -y <APS_ROOT>/plugins/copilot-cli/skills/copilot-cli-agent
If
-ycrashes: run without it and press Enter to accept defaults. Both skills are required:os-eval-runnergates iterations,copilot-cli-agentproposes mutations.
1e. Seed commit and push
cd <lab-repo>
git add . && git commit -m "seed: install os-eval-runner engine"
git push origin main
1f. Verify Python 3
python3 --version # must be 3.8+
Phase 2: Generate eval-instructions.md
Read the template:
<APS_ROOT>/plugins/agent-agentic-os/skills/os-eval-lab-setup/assets/templates/eval-instructions.template.md
# (symlink → plugins/agent-agentic-os/assets/templates/eval-instructions.template.md)
Replace all {{PLACEHOLDERS}} with intake values:
| Placeholder | Value |
|---|---|
{{SKILL_DISPLAY_NAME}} |
Human-readable skill name (e.g. "Link Checker") |
{{SKILL_NAME}} |
Skill folder name (e.g. link-checker-agent) |
{{PLUGIN_DIR}} |
Plugin folder name (e.g. link-checker) |
{{MUTATION_TARGET}} |
SKILL.md |
{{GITHUB_REPO_URL}} |
The GitHub URL |
{{ROUND_LABEL}} |
The round label |
{{SKILL_EVAL_SOURCE}} |
<APS_ROOT>/plugins/agent-agentic-os/skills/os-eval-runner |
{{MASTER_PLUGIN_PATH}} |
<APS_ROOT>/plugins/<plugin-name> |
Write the rendered output to <lab-repo>/eval-instructions.md.
Phase 3: Confirm Ready
Report to the user:
- Lab repo path and confirmed git remote
- Files copied from master plugin
- Engine installed at
.agents/skills/os-eval-runner/ eval-instructions.mdwritten at lab repo root
Next step: open a new Claude Code session pointed at the lab repo and say:
"Follow eval-instructions.md" — the eval agent will run the full 10-iteration loop.
When the run completes, use the os-eval-backport skill in this repo to review and
apply approved changes back to master sources.
What to Expect: Meta-Circular Improvement
When os-eval-runner is installed as a peer in the lab repo alongside the target skill,
the improvement loop may propose changes to os-eval-runner itself — its SKILL.md, scripts,
or evals — in addition to the target skill. This is expected and welcome, not a bug.
Why it happens: the agent can read all installed skills and proposes the highest-leverage
change it can find, regardless of which skill it's in. The lab copy of os-eval-runner
is a safe mutation target because:
- It's a physical copy, not a symlink to master
evaluate.pystill gates every change — including changes toeval_runner.pyitselfos-eval-backportreview is the gate before any change reaches the canonical source
At backport review: treat changes to os-eval-runner files with extra scrutiny —
the evaluator modifying its own scoring logic is high-leverage. Verify the change doesn't
introduce a scoring bias that inflates future KEEP rates. See os-eval-backport SKILL.md
for the review checklist.
This pattern is structurally equivalent to what Meta-Harness (Lee et al., arXiv:2603.28052) calls "harness self-improvement": the outer loop discovers improvements to the evaluation machinery itself, not just the target. The backport gate is the Pareto review that controls what flows to production.
More from richfrem/agent-plugins-skills
markdown-to-msword-converter
Converts Markdown files to one MS Word document per file using plugin-local scripts. V2 includes L5 Delegated Constraint Verification for strict binary artifact linting.
52excel-to-csv
>
32zip-bundling
Create technical ZIP bundles of code, design, and documentation for external review or context sharing. Use when you need to package multiple project files into a portable `.zip` archive instead of a single Markdown file.
29learning-loop
(Industry standard: Loop Agent / Single Agent) Primary Use Case: Self-contained research, content generation, and exploration where no inner delegation is required. Self-directed research and knowledge capture loop. Use when: starting a session (Orientation), performing research (Synthesis), or closing a session (Seal, Persist, Retrospective). Ensures knowledge survives across isolated agent sessions.
26ollama-launch
Start and verify the local Ollama LLM server. Use when Ollama is needed for RLM distillation, seal snapshots, embeddings, or any local LLM inference — and it's not already running. Checks if Ollama is running, starts it if not, and verifies the health endpoint.
26spec-kitty-checklist
A standard Spec-Kitty workflow routine.
26