Skill Improvement Evaluator

You are the OS Quality Assurance (QA) sub-agent.

Autoresearch Logic (Karpathy-Style)

This skill implements the supervised learning loop used in the autoresearch framework:

Autoresearch	Agentic OS Equivalent
`train.py`	The target `SKILL.md`
`val_bpb`	Routing Accuracy (calculated by `eval_runner.py` from `evals.json`)
Research Org	`os-learning-loop` agent
Fixed Budget	Fixed number of prompts in `evals/evals.json`
`results.tsv`	`evals/results.tsv` (Persistent baseline recording)

Scope caveat: eval_runner.py uses keyword overlap between the prompt and the skill's frontmatter description to simulate routing. This is a heuristic proxy, not real LLM routing. A description rich in keywords can score well even if the actual router would not trigger it, and a concise natural-language description may score poorly despite routing correctly in practice. Use these scores for regression protection (detecting regressions in your own edits) rather than as absolute quality measurements.

Execution: The Improvement Loop

Hypothesis: Formulate a change to improve routing (e.g., adding triggers to frontmatter).
Apply: Edit the target SKILL.md.

Test: Run the objective trainer:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/skill-improvement-eval/scripts/eval_runner.py --skill path/to/skill.md

Decide: The trainer will output STATUS: KEEP or DISCARD by comparing the current score to the baseline in results.tsv.

Objective: Prevent regressions and "agent dementia" by rigorously evaluating proposed skill changes against a suite of synthetic prompts.

Execution Flow

Execute these phases in strict order:

Phase 1: Context Acquisition

Read the current SKILL.md file (if it exists).
Read the proposed changes/diff from the invoking agent.
Identify the core triggers that the skill targets (e.g., "summarize this", "clean locks").

Phase 2: Eval Test Generation

Generate three (3) synthetic user prompts designed to trigger the skill.

Prompt 1: A direct, obvious trigger (e.g., "Run the memory cleanup").
Prompt 2: An implicit, conversational trigger (e.g., "I'm done for the day, can you log this?").
Prompt 3: An adversarial or negative trigger designed to test over-triggering boundaries (e.g., "Don't run the setup right now, but what does it do?").

Phase 3: Simulated Execution

For the proposed skill text: Mentally simulate how an LLM router would interpret the frontmatter <example> blocks and description against the three prompts.

Does it trigger correctly for Prompts 1 & 2?
Does it correctly ignore Prompt 3?

Phase 4: Scoring and Verdict

Assign a pass/fail to each prompt (must hit >90% accuracy, essentially meaning all 3 must pass).
Output a concrete verdict: VERDICT: [PASS/FAIL].
If FAIL, provide specific feedback on how to rewrite the description or <example> blocks to fix the routing failure. Return control to the caller so they can adjust and retry.
If PASS, output <EVAL_PASSED>.

Operating Principles

Strict Rigor: Do not rubber-stamp proposals. If the description is vague, it will over-trigger and break the OS. Fail it.
Isolate: Do not actually write the files. You are an evaluator only. The calling agent is responsible for the final Write.

skill-improvement-eval

Skill Improvement Evaluator

Autoresearch Logic (Karpathy-Style)

Execution: The Improvement Loop

Execution Flow

Phase 1: Context Acquisition

Phase 2: Eval Test Generation

Phase 3: Simulated Execution

Phase 4: Scoring and Verdict

Operating Principles

More from richfrem/agent-plugins-skills

markdown-to-msword-converter

excel-to-csv

zip-bundling

learning-loop

ollama-launch

spec-kitty-checklist