skill-refiner
Skill Refiner: Iterative Self-Improvement Loop
Adaptive evaluation loop for AI skill collections, inspired by Karpathy's AutoResearch. Orchestrates repeated score-improve-verify cycles using skill-creator as the engine and mandatory peer review as an adversarial check (cross-model when a secondary harness is available, fresh-context self-review as the minimum fallback).
When to use
- Batch-improving the entire skill collection after a period of manual edits
- Targeted improvement of one named skill when the user explicitly asks for skill-refiner, multiple iterations, a score target, or "no ceiling" polish
- Running quality sweeps before a release or publish
- Triggering a self-improvement cycle where skills bootstrap each other
- After adding several new skills that need polish and consistency alignment
- When cross-model perspective would catch single-model blind spots
- Periodic maintenance: scheduled improvement runs to keep skills current
When NOT to use
- Creating a new skill from scratch - use skill-creator (Mode 1)
- Single-pass review of one skill without iteration, scoring, or peer review - use skill-creator (Mode 2)
- One-off collection audit without iteration - use skill-creator (Mode 3)
- Full codebase review (code, not skills) - use full-review
- Style/slop audit on application code - use anti-slop
Configuration
skill-refiner [--iterations N] [--mode MODE] [--secondary HARNESS] [--threshold N] [--plateau N]
| Flag | Default | Description |
|---|---|---|
--iterations |
10 | Maximum iterations for phase 1 |
--mode |
circuit-breaker | auto, circuit-breaker, or step |
--secondary |
auto-detect | Secondary harness for cross-model review, or none |
--threshold |
85 | Focus threshold - skip skills scoring above this (user can override max) |
--plateau |
2 | Minimum score delta to keep iterating |
Environment override: SKILL_REFINER_SECONDARY=<harness> (CLI flag takes precedence)
Checkpoint Modes
circuit-breaker (default): runs autonomously, auto-pauses on score regression, contested major flags, or plateau. Always pauses before phase 2.
auto: fully autonomous through phase 1. Still pauses before phase 2 and on contested major flags (non-configurable).
step: pauses after every iteration for manual review. Best for first run or learning.
Performance
- Batch similar edits and validations to reduce repeated full-collection scans.
- Prioritize low-scoring or recently changed skills before polishing already-healthy ones.
- Use focused diffs and line-count checks after each batch to avoid late cleanup churn.
Best Practices
- Snapshot evaluation criteria before editing the skills that define the criteria.
- Revert changes that add complexity without improving behavior.
- Keep run history factual and free of unverifiable score inflation.
Workflow
Phase 0: Setup
- Create feature branch:
skill-refiner/YYYY-MM-DD-HHMMSSfrom current HEAD. Preserve dirty worktrees; branching isolates the run, it does not imply cleanup. If already on a run branch for this sweep, record it instead of nesting another branch. Do not mix unrelated dirty files into refiner commits; isolate them with a path-limited stash only when authorized. - Load run history: read
.refiner-runs.jsonfrom the collection root (if it exists). Use previous run data for: baseline score comparison (detect regressions from external changes), model/harness change detection (flag if the primary or secondary model changed since last run - new model = new baseline, not a comparable delta), and skip analysis (don't re-attempt improvements that were already tried and reverted in a recent run). - Build skill inventory: list all skills, exclude phase-2 targets (skill-creator, skill-refiner) from the improvement pool
- Detect primary harness: check environment to identify which AI CLI is running this session
- Probe for secondary harness: run three-step validation (PATH check, config check,
smoke test) per
references/harness-detection.md. Announce result. - If no secondary found: always fall back to self-review. Spawn a fresh agent on
the current harness with the review prompt template from
references/harness-detection.md. Label as "same-model fresh-context review" in scoring, weight at 3% instead of 5% (composite becomes gate/40/55/3, renormalize the missing 2% proportionally to AI Self-Check and Behavioral). This catches confirmation bias but shares the primary model's blind spots. Skipping review entirely is not an option - a fresh-context self-review is the minimum bar. If the harness doesn't support subagents, run the review prompt as a separate CLI invocation (claude -p,codex exec, etc.).
Phase 1: Regular Iterations
- Iteration 1 - full sweep: score every skill in the pool using the four-component
model from
references/evaluation-criteria.md- Structural: run lint-skills.sh + validate-spec.sh
- AI Self-Check: invoke skill-creator review mode on each skill
- Behavioral: run test prompts from
references/test-cases.md. For skills without pre-written test cases, auto-generate 2-3 test prompts from the skill's "When to use" section and quality signals from its AI Self-Check. Log a warning that generated tests are lower quality than hand-written ones. Optionally save generated tests to a test-cases-local.md file alongside test-cases.md so they accumulate across runs. - Cross-model: skip on first iteration (no diff to review yet)
- Log baseline scores: record per-skill and aggregate scores
- Iteration 2+: enter adaptive focus mode. For a user-requested single-skill run, treat that skill as the whole phase-1 pool, run at least the requested iteration count, and keep iterating until the explicit score target is reached, quality plateaus, or a circuit breaker fires.
- Select targets: identify skills scoring below the focus threshold
- For each targeted skill, run the improvement cycle:
a. Read current SKILL.md and all reference files
b. Invoke skill-creator review mode - collect findings
c. Run behavioral test - score current output quality
d. Propose targeted improvements based on findings (not random changes)
e. Apply changes to SKILL.md (and references if needed)
f. Re-score: run lint + AI Self-Check + behavioral test
g. Karpathy gate: if score improved, keep. If not, revert. No exceptions.
h. If cross-model review available, send the diff to secondary harness
i. Process flags per
references/harness-detection.mdverification protocol j. If secondary flags major issue and primary agrees: revert k. If secondary flags major issue and primary disagrees: escalate to circuit breaker - Commit iteration: one commit with all improvements from this iteration
Format:
refactor(skill-refiner): iteration N - skill1(+X), skill2(+Y) - Log iteration summary:
--- iteration N / max ------------------------------------------- improved: skill1 (72 > 80 | G:pass A:76 B:78 X:90), skill2 (68 > 73 | G:pass A:70 B:72 X:100) gated: skillZ (lint/spec failed - excluded from scoring) skipped: M skills above threshold reverted: skill3 (proposed change scored -2, rolled back | G:pass A:74 B:69 X:100) contested: skill4 (secondary flagged major, primary disagreed) plateau: yes/no (max delta: +X) ----------------------------------------------------------------- - Check termination conditions (phase 1 always flows into phase 2 on termination,
except on circuit-breaker pauses which wait for user input first):
- Plateau detected (max delta < plateau threshold)? Terminate phase 1.
- All skills above focus threshold? Bump threshold by 5 and continue. If threshold is already at max (95) and all skills still clear it, terminate phase 1.
- Iteration cap reached? Terminate phase 1.
- Circuit breaker triggered? Pause for user input.
- Repeat from step 9 until terminated
Phase 2: Meta-Improvement
- Announce: "Entering phase 2 - meta-improvement. This always requires human review."
- Snapshot evaluation criteria:
- Copy skill-creator's AI Self-Check section to a temp location
- Copy
references/evaluation-criteria.mdto a temp location - Copy skill-creator's
conventions.mdreference to a temp location These snapshots are the evaluation baseline for phase 2.
- Improve skill-creator: run the improvement cycle (steps 10a-10k) using the snapshot as the evaluation criteria, not skill-creator's live version
- Improve skill-refiner: same process, using the snapshot
- Improve lint scripts (lint-skills.sh, validate-spec.sh):
- Capture baseline: run both scripts, save full output
- Propose improvements
- Apply changes
- Run regression: compare output to baseline
- If false positives or false negatives introduced: revert
- If clean: keep
- Commit phase 2: one commit per target
Format:
refactor(skill-refiner): meta - improve <target> (+N) - Pause for human review: display phase 2 changes, wait for approval.
This checkpoint is non-configurable - it fires even in
--mode auto. A direct user approval such as "continue" or "proceed" counts as approval to resume.
Phase 3: Summary
- Final report: write a human-readable report first, then machine-readable run history.
Include branch, pool, config, every changed skill, score before/after, delta, files changed,
verification commands, peer-review flags, reverted changes, private-skill handling, and
skipped checks. Do not output only JSON.
=== skill-refiner run complete =================================== Branch: skill-refiner/YYYY-MM-DD-HHMMSS Primary: <harness> <version> (<model>, effort: <level>) Secondary: <harness> <version> (<model>, effort: <level>) | none Pool: N skills (skill-creator, skill-refiner excluded) Config: iterations=M, threshold=T, mode=MODE, plateau=P Iterations: N (of max M) Terminated: plateau / threshold / cap / user Score changes: skill1: 62 > 88 (+26) [G:pass A:84 B:86 X:90] skill2: 71 > 85 (+14) [G:pass A:82 B:79 X:100] ... skill-creator: 80 > 84 (+4) [G:pass A:82 B:81 X:100] [meta] skill-refiner: 78 > 83 (+5) [G:pass A:80 B:79 X:100] [meta] Aggregate: avg X.X | min X.X | max X.X Reverted: X changes across Y iterations Contested: Z flags escalated to human ================================================================= - Write run history: append this run's metadata to
.refiner-runs.jsonin the collection root. Include: run_id, branch, date, primary/secondary harness+model+effort, config, pool size, termination reason, cross-model flag counts, before/after per-skill scores (component breakdown + composite, or clearly labeled estimates if the run used a targeted manual rubric instead of the full automated sweep), and a changes summary. When updating an existing history file, append the new object without reserializing the whole file; do not normalize or rewrite old entries just because a JSON writer changes escaping, commas, or whitespace. Commit with the phase 3 summary. - Announce branch: remind user to review and merge when ready
AI Self-Check
Before committing any skill modification, verify:
- Lint passes: lint-skills.sh exits 0 for the modified skill
- Spec valid: validate-spec.sh exits 0 for the modified skill
- Score improved: composite score is strictly higher than before the change
- No content regression: change does not remove critical sections, warnings, or cross-references without replacement
- Simplicity maintained: change does not add unnecessary complexity for marginal gains
- Cross-references intact: all skill names in bold still resolve to existing skills
- Target ~500 lines: modified SKILL.md stays near 500 lines. Hard max 600
- ASCII only: no non-ASCII characters introduced (except allowed emoji indicators)
- Immutability respected: no phase-1 modification to evaluation criteria, test cases, lint scripts, skill-creator, or skill-refiner
- Current source checked: dated versions, CLI flags, API names, and support windows are verified against primary docs before repeating them
- Hidden state identified: local config, credentials, caches, contexts, branches, cluster targets, or previous runs are made explicit before acting
- Verification is real: final checks exercise the actual runtime, parser, service, or integration point instead of only linting prose or happy paths
- Score discipline kept: changes are kept only when they improve measured quality or fix a verified defect
- Local-only scope respected: public and private skills are separated before commits or release notes
Output Contract
See skills/_shared/output-contract.md for the full contract.
- Skill name: SKILL-REFINER
- Deliverable bucket:
audits - Mode: conditional. When invoked to analyze, review, audit, or improve existing repo content outside the refiner workflow, emit the full contract -- boxed inline header, body summary inline plus per-finding detail in the deliverable file, boxed conclusion, conclusion table -- and write the deliverable to
docs/local/audits/skill-refiner/<YYYY-MM-DD>-<slug>.md. When invoked to run the refiner workflow (its primary mode), use the existing Phase 3 "Final report" format described in the workflow; that build-mode output is unchanged by this contract. - Severity scale:
P0 | P1 | P2 | P3 | info(see shared contract; only used in audit/review mode).
Rules
- Immutability in phase 1: never modify
references/evaluation-criteria.md,references/test-cases.md, lint-skills.sh, validate-spec.sh, skill-creator, or skill-refiner during phase 1. Violation = abort the run. - Karpathy gate: only directional improvements survive. If a change does not improve the composite score, revert it. No exceptions, no "it looks better."
- Verify flags: never take cross-model flags at face value. Primary reviews every flag independently. Disagreements on major flags go to human.
- Snapshot before meta: always snapshot evaluation criteria before phase 2. Evaluate against the snapshot, never the live version being modified.
- Phase 2 always pauses: even in
--mode auto. Non-configurable. - Contested major flags always pause: even in
--mode auto. Non-configurable. - Simplicity criterion: all else being equal, simpler is better. Deletions that maintain score are preferred over additions that marginally improve it.
- One commit per iteration: bundle improvements, include score deltas in message.
- Branch isolation: all work on a feature branch. Never modify main directly.
- Human-readable report required: every run ends with a report that names changes, before/after scores, verification, peer-review flags, skipped checks, and next action.
- Read before edit: always read the full skill before proposing changes. Never edit from memory or assumption.
Related Skills
- skill-creator - the evaluation and improvement engine. skill-refiner invokes skill-creator's review mode (Mode 2) for scoring and its improve mode for generating changes. skill-creator handles individual skill quality; skill-refiner handles iteration, prioritization, and orchestration. Primary dependency.
- full-review - one-off collection audit across code-review, anti-slop, security-audit, and update-docs. Use full-review for a single pass over application code; use skill-refiner for iterative improvement of skill files.
- anti-slop - code quality patterns. skill-refiner may invoke anti-slop principles through skill-creator during improvement, but does not call anti-slop directly. Different domain: anti-slop audits application code, skill-refiner audits skill files.