/evaluate:improve

Analyze evaluation results and suggest concrete improvements to a skill.

When to Use This Skill

Use this skill when...	Use alternative when...
Have eval results and want to improve the skill	Need to run evals first -> `/evaluate:skill`
Want to improve skill description for better triggering	Want to view raw results -> `/evaluate:report`
Iterating on a skill to increase pass rate	Want to file a bug -> `/feedback:session`
Optimizing skill instructions after benchmarking	Need structural fixes -> `plugin-compliance-check.sh`

Parameters

Parse these from $ARGUMENTS:

Parameter	Default	Description
`<plugin/skill-name>`	required	Path as `plugin-name/skill-name`
`--apply`	false	Apply approved changes to SKILL.md
`--description-only`	false	Focus on description improvements only

Execution

Step 1: Load eval results

Read the most recent benchmark from:

<plugin-name>/skills/<skill-name>/eval-results/benchmark.json

If no results exist, suggest running /evaluate:skill first and stop.

Also read the current SKILL.md to understand the skill.

Step 2: Analyze results

Delegate analysis to the eval-analyzer agent via Task:

Task subagent_type: eval-analyzer
Prompt: Analyze these evaluation results and identify improvement opportunities.
  Skill: <path to SKILL.md>
  Benchmark: <benchmark.json contents>
  Mode: comparison (if baseline data exists) or benchmark (otherwise)

The analyzer produces categorized suggestions:

instructions: Execution flow improvements
description: Better intent-matching text
examples: Missing or insufficient examples
error_handling: Missing edge cases
tools: Better tool configurations
structure: Organizational improvements

Step 3: Filter suggestions

If --description-only, filter to only description category suggestions.

Sort remaining suggestions by priority (high > medium > low).

Step 4: Present suggestions

Present the categorized suggestions to the user:

## Improvement Suggestions: <plugin/skill-name>

Current pass rate: 72%

### High Priority

1. **[instructions]** Add explicit error handling for missing git config
   Evidence: eval-003 fails because the skill doesn't check for git user.name

2. **[description]** Add "conventional commit" as trigger phrase
   Evidence: Skill not selected when user says "make a conventional commit"

### Medium Priority

3. **[examples]** Add breaking change example to execution steps
   Evidence: eval-004 inconsistently handles breaking changes

### Low Priority

4. **[structure]** Move flag reference to Quick Reference table
   Evidence: Flags scattered across multiple sections

If --apply is NOT set, stop here.

Step 5: Apply changes (if --apply)

Use AskUserQuestion to let the user select which suggestions to apply:

Which improvements should I apply?
[x] Add error handling for missing git config
[x] Add trigger phrases to description
[ ] Add breaking change example
[ ] Restructure flag reference

For each approved suggestion:

Read the current SKILL.md
Apply the change using Edit
Update the modified date in frontmatter

After applying changes, update (or create) the history file at:

<plugin-name>/skills/<skill-name>/eval-results/history.json

Add a new iteration entry recording:

Version number (increment from previous)
Timestamp
Pass rate from current benchmark
Summary of changes made

Step 6: Suggest re-evaluation

After applying changes, suggest:

Changes applied. Run `/evaluate:skill <plugin/skill-name>` to measure improvement.

Agentic Optimizations

Context	Command
Read benchmark	`cat <plugin>/skills/<skill>/eval-results/benchmark.json \| jq .summary`
Read skill	`cat <plugin>/skills/<skill>/SKILL.md`
Read history	`cat <plugin>/skills/<skill>/eval-results/history.json \| jq '.iterations[-1]'`
Check pass rate	`cat <plugin>/skills/<skill>/eval-results/benchmark.json \| jq '.summary.with_skill.mean_pass_rate'`

Quick Reference

Flag	Description
`--apply`	Apply approved changes to SKILL.md
`--description-only`	Focus on description improvements only

evaluate-improve