Skill Refiner: Iterative Self-Improvement Loop

Adaptive evaluation loop for AI skill collections, inspired by Karpathy's AutoResearch. Orchestrates repeated score-improve-verify cycles using skill-creator as the engine and mandatory peer review as an adversarial check (cross-model when a secondary harness is available, fresh-context self-review as the minimum fallback).

When to use

Batch-improving the entire skill collection after a period of manual edits
Targeted improvement of one named skill when the user explicitly asks for skill-refiner, multiple iterations, a score target, or "no ceiling" polish
Running quality sweeps before a release or publish
Triggering a self-improvement cycle where skills bootstrap each other
After adding several new skills that need polish and consistency alignment
When cross-model perspective would catch single-model blind spots
Periodic maintenance: scheduled improvement runs to keep skills current

When NOT to use

Creating a new skill from scratch - use skill-creator (Mode 1)
Single-pass review of one skill without iteration, scoring, or peer review - use skill-creator (Mode 2)
One-off collection audit without iteration - use skill-creator (Mode 3)
Full codebase review (code, not skills) - use full-review
Style/slop audit on application code - use anti-slop

Configuration

skill-refiner [--iterations N] [--mode MODE] [--secondary HARNESS] [--threshold N] [--plateau N]

Flag	Default	Description
`--iterations`	10	Maximum iterations for phase 1
`--mode`	circuit-breaker	`auto`, `circuit-breaker`, or `step`
`--secondary`	auto-detect	Secondary harness for cross-model review, or `none`
`--threshold`	85	Focus threshold - skip skills scoring above this (user can override max)
`--plateau`	2	Minimum score delta to keep iterating

Environment override: SKILL_REFINER_SECONDARY=<harness> (CLI flag takes precedence)

Checkpoint Modes

circuit-breaker (default): runs autonomously, auto-pauses on score regression, contested major flags, or plateau. Always pauses before phase 2.

auto: fully autonomous through phase 1. Still pauses before phase 2 and on contested major flags (non-configurable).

step: pauses after every iteration for manual review. Best for first run or learning.

Performance

Batch similar edits and validations to reduce repeated full-collection scans.
Prioritize low-scoring or recently changed skills before polishing already-healthy ones.
Use focused diffs and line-count checks after each batch to avoid late cleanup churn.

Best Practices

Snapshot evaluation criteria before editing the skills that define the criteria.
Revert changes that add complexity without improving behavior.
Keep run history factual and free of unverifiable score inflation.

Workflow

Phase 0: Setup

Create feature branch: skill-refiner/YYYY-MM-DD-HHMMSS from current HEAD. Preserve dirty worktrees; branching isolates the run, it does not imply cleanup. If already on a run branch for this sweep, record it instead of nesting another branch. Do not mix unrelated dirty files into refiner commits; isolate them with a path-limited stash only when authorized.
Load run history: read .refiner-runs.json from the collection root (if it exists). Use previous run data for: baseline score comparison (detect regressions from external changes), model/harness change detection (flag if the primary or secondary model changed since last run - new model = new baseline, not a comparable delta), and skip analysis (don't re-attempt improvements that were already tried and reverted in a recent run).
Build skill inventory: list all skills, exclude phase-2 targets (skill-creator, skill-refiner) from the improvement pool
Detect primary harness: check environment to identify which AI CLI is running this session
Probe for secondary harness: run three-step validation (PATH check, config check, smoke test) per references/harness-detection.md. Announce result.
If no secondary found: always fall back to self-review. Spawn a fresh agent on the current harness with the review prompt template from references/harness-detection.md. Label as "same-model fresh-context review" in scoring, weight at 3% instead of 5% (composite becomes gate/40/55/3, renormalize the missing 2% proportionally to AI Self-Check and Behavioral). This catches confirmation bias but shares the primary model's blind spots. Skipping review entirely is not an option - a fresh-context self-review is the minimum bar. If the harness doesn't support subagents, run the review prompt as a separate CLI invocation (claude -p, codex exec, etc.).

Phase 1: Regular Iterations

Iteration 1 - full sweep: score every skill in the pool using the four-component model from references/evaluation-criteria.md
- Structural: run lint-skills.sh + validate-spec.sh
- AI Self-Check: invoke skill-creator review mode on each skill
- Behavioral: run test prompts from references/test-cases.md. For skills without pre-written test cases, auto-generate 2-3 test prompts from the skill's "When to use" section and quality signals from its AI Self-Check. Log a warning that generated tests are lower quality than hand-written ones. Optionally save generated tests to a test-cases-local.md file alongside test-cases.md so they accumulate across runs.
- Cross-model: skip on first iteration (no diff to review yet)
Log baseline scores: record per-skill and aggregate scores
Iteration 2+: enter adaptive focus mode. For a user-requested single-skill run, treat that skill as the whole phase-1 pool, run at least the requested iteration count, and keep iterating until the explicit score target is reached, quality plateaus, or a circuit breaker fires.
Select targets: identify skills scoring below the focus threshold
For each targeted skill, run the improvement cycle: a. Read current SKILL.md and all reference files b. Invoke skill-creator review mode - collect findings c. Run behavioral test - score current output quality d. Propose targeted improvements based on findings (not random changes) e. Apply changes to SKILL.md (and references if needed) f. Re-score: run lint + AI Self-Check + behavioral test g. Karpathy gate: if score improved, keep. If not, revert. No exceptions. h. If cross-model review available, send the diff to secondary harness i. Process flags per references/harness-detection.md verification protocol j. If secondary flags major issue and primary agrees: revert k. If secondary flags major issue and primary disagrees: escalate to circuit breaker
Commit iteration: one commit with all improvements from this iteration Format: refactor(skill-refiner): iteration N - skill1(+X), skill2(+Y)

Log iteration summary:

--- iteration N / max -------------------------------------------
improved:  skill1 (72 > 80 | G:pass A:76 B:78 X:90), skill2 (68 > 73 | G:pass A:70 B:72 X:100)
gated:     skillZ (lint/spec failed - excluded from scoring)
skipped:   M skills above threshold
reverted:  skill3 (proposed change scored -2, rolled back | G:pass A:74 B:69 X:100)
contested: skill4 (secondary flagged major, primary disagreed)
plateau:   yes/no (max delta: +X)
-----------------------------------------------------------------

Check termination conditions (phase 1 always flows into phase 2 on termination, except on circuit-breaker pauses which wait for user input first):
- Plateau detected (max delta < plateau threshold)? Terminate phase 1.
- All skills above focus threshold? Bump threshold by 5 and continue. If threshold is already at max (95) and all skills still clear it, terminate phase 1.
- Iteration cap reached? Terminate phase 1.
- Circuit breaker triggered? Pause for user input.
Repeat from step 9 until terminated

Phase 2: Meta-Improvement

Announce: "Entering phase 2 - meta-improvement. This always requires human review."
Snapshot evaluation criteria:
- Copy skill-creator's AI Self-Check section to a temp location
- Copy references/evaluation-criteria.md to a temp location
- Copy skill-creator's conventions.md reference to a temp location These snapshots are the evaluation baseline for phase 2.
Improve skill-creator: run the improvement cycle (steps 10a-10k) using the snapshot as the evaluation criteria, not skill-creator's live version
Improve skill-refiner: same process, using the snapshot
Improve lint scripts (lint-skills.sh, validate-spec.sh):
- Capture baseline: run both scripts, save full output
- Propose improvements
- Apply changes
- Run regression: compare output to baseline
- If false positives or false negatives introduced: revert
- If clean: keep
Commit phase 2: one commit per target Format: refactor(skill-refiner): meta - improve <target> (+N)
Pause for human review: display phase 2 changes, wait for approval. This checkpoint is non-configurable - it fires even in --mode auto. A direct user approval such as "continue" or "proceed" counts as approval to resume.

Phase 3: Summary

Final report: write a human-readable report first, then machine-readable run history. Include branch, pool, config, every changed skill, score before/after, delta, files changed, verification commands, peer-review flags, reverted changes, private-skill handling, and skipped checks. Do not output only JSON.

=== skill-refiner run complete ===================================
Branch:     skill-refiner/YYYY-MM-DD-HHMMSS
Primary:    <harness> <version> (<model>, effort: <level>)
Secondary:  <harness> <version> (<model>, effort: <level>) | none
Pool:       N skills (skill-creator, skill-refiner excluded)
Config:     iterations=M, threshold=T, mode=MODE, plateau=P

Iterations: N (of max M)
Terminated: plateau / threshold / cap / user

Score changes:
  skill1:  62 > 88 (+26)  [G:pass A:84 B:86 X:90]
  skill2:  71 > 85 (+14)  [G:pass A:82 B:79 X:100]
  ...
  skill-creator: 80 > 84 (+4)  [G:pass A:82 B:81 X:100] [meta]
  skill-refiner: 78 > 83 (+5)  [G:pass A:80 B:79 X:100] [meta]

Aggregate:  avg X.X | min X.X | max X.X
Reverted:   X changes across Y iterations
Contested:  Z flags escalated to human
=================================================================

Write run history: append this run's metadata to .refiner-runs.json in the collection root. Include: run_id, branch, date, primary/secondary harness+model+effort, config, pool size, termination reason, cross-model flag counts, before/after per-skill scores (component breakdown + composite, or clearly labeled estimates if the run used a targeted manual rubric instead of the full automated sweep), and a changes summary. When updating an existing history file, append the new object without reserializing the whole file; do not normalize or rewrite old entries just because a JSON writer changes escaping, commas, or whitespace. Commit with the phase 3 summary.
Announce branch: remind user to review and merge when ready

AI Self-Check

Before committing any skill modification, verify:

Output Contract

See skills/_shared/output-contract.md for the full contract.

Skill name: SKILL-REFINER
Deliverable bucket: audits
Mode: conditional. When invoked to analyze, review, audit, or improve existing repo content outside the refiner workflow, emit the full contract -- boxed inline header, body summary inline plus per-finding detail in the deliverable file, boxed conclusion, conclusion table -- and write the deliverable to docs/local/audits/skill-refiner/<YYYY-MM-DD>-<slug>.md. When invoked to run the refiner workflow (its primary mode), use the existing Phase 3 "Final report" format described in the workflow; that build-mode output is unchanged by this contract.
Severity scale: P0 | P1 | P2 | P3 | info (see shared contract; only used in audit/review mode).

Rules

Immutability in phase 1: never modify references/evaluation-criteria.md, references/test-cases.md, lint-skills.sh, validate-spec.sh, skill-creator, or skill-refiner during phase 1. Violation = abort the run.
Karpathy gate: only directional improvements survive. If a change does not improve the composite score, revert it. No exceptions, no "it looks better."
Verify flags: never take cross-model flags at face value. Primary reviews every flag independently. Disagreements on major flags go to human.
Snapshot before meta: always snapshot evaluation criteria before phase 2. Evaluate against the snapshot, never the live version being modified.
Phase 2 always pauses: even in --mode auto. Non-configurable.
Contested major flags always pause: even in --mode auto. Non-configurable.
Simplicity criterion: all else being equal, simpler is better. Deletions that maintain score are preferred over additions that marginally improve it.
One commit per iteration: bundle improvements, include score deltas in message.
Branch isolation: all work on a feature branch. Never modify main directly.
Human-readable report required: every run ends with a report that names changes, before/after scores, verification, peer-review flags, skipped checks, and next action.
Read before edit: always read the full skill before proposing changes. Never edit from memory or assumption.

Related Skills

skill-creator - the evaluation and improvement engine. skill-refiner invokes skill-creator's review mode (Mode 2) for scoring and its improve mode for generating changes. skill-creator handles individual skill quality; skill-refiner handles iteration, prioritization, and orchestration. Primary dependency.
full-review - one-off collection audit across code-review, anti-slop, security-audit, and update-docs. Use full-review for a single pass over application code; use skill-refiner for iterative improvement of skill files.
anti-slop - code quality patterns. skill-refiner may invoke anti-slop principles through skill-creator during improvement, but does not call anti-slop directly. Different domain: anti-slop audits application code, skill-refiner audits skill files.

skill-refiner

Skill Refiner: Iterative Self-Improvement Loop

When to use

When NOT to use

Configuration

Checkpoint Modes

Performance

Best Practices

Workflow

Phase 0: Setup

Phase 1: Regular Iterations

Phase 2: Meta-Improvement

Phase 3: Summary

AI Self-Check

Output Contract

Rules

Related Skills

More from iuliandita/skills

databases

code-review

prompt-generator

ci-cd

command-prompt

security-audit