Skill Evolution & Improvement

Process

Step 1: Diagnose the Issue

Ask the user or analyze logs to identify the problem category:

Category A: Triggering Issues

Under-triggering: Skill doesn't activate when it should
Over-triggering: Skill activates when it shouldn't
Mis-triggering: Wrong sub-skill activates

Category B: Execution Issues

Incomplete workflows: Skill stops before finishing
Incorrect output: Results don't match expectations
Missing error handling: Failures not handled gracefully
Performance: Too slow or too many token used

Category C: Architecture Issues

Missing capability: New use case not covered
Scale issues: Skill too large, needs decomposition
Cross-reference problems: Links to non-existent files

Category D: Quality Issues

Inconsistent results: Different outputs for same input
Vague instructions: Claude interprets differently each time
Missing examples: No concrete guidance

Step 2: Apply Fix by Category

Fix: Under-Triggering

Read current description
Identify missing trigger phrases
Add domain keywords and paraphrases
Add file type mentions if relevant
Test with 5 queries that should now trigger

Common causes:

Description too generic
Missing common paraphrases
Technical jargon without lay terms

Fix template:

# Before (under-triggers)
description: Analyzes code quality

# After (specific triggers)
description: >
  Static code analysis and quality assessment. Checks code style,
  complexity, security vulnerabilities, and test coverage. Use when
  user says "code review", "code quality", "lint", "static analysis",
  "code smell", "code audit", or "check my code".

Fix: Over-Triggering

Read current description
Identify why unrelated queries trigger it
Add negative triggers ("Do NOT use for...")
Make description more specific
Test with 5 queries that should NOT trigger

Fix template:

# Before (over-triggers)
description: Processes documents for review

# After (specific + negative triggers)
description: >
  Processes PDF legal documents for contract clause extraction and
  compliance review. Use for legal contracts, NDAs, terms of service.
  Do NOT use for general document editing, formatting, or non-legal PDFs.

Fix: Execution Issues

Identify the failing step in the workflow
Add explicit validation gates between steps
Add error handling with clear recovery instructions
Add "If X fails, then Y" fallback paths
Consider adding a script for fragile operations

Fix: Quality Issues

Replace vague instructions with specific ones
Add concrete examples of expected input/output
Add explicit "do this, not that" comparisons
Add quality check steps before final output
Consider adding a validation script

Step 3: Iteration Workspace Protocol

Use structured workspaces to track improvements across iterations:

eval-workspace/
  iteration-1/          # First version
    eval-0/with_skill/  # Eval results
    eval-0/baseline/
    benchmark.json      # Aggregated metrics
    benchmark.md        # Human-readable report
    feedback.json       # User feedback
  iteration-2/          # After first improvement
    eval-0/with_skill/
    eval-0/baseline/
    benchmark.json
    benchmark.md
    feedback.json

The iteration loop:

Apply the fix to the skill
Run /skill-forge eval <path> into iteration-<N+1>/
Run /skill-forge benchmark <path> with --previous iteration-<N>/
Review benchmark comparison for regressions
Collect user feedback into feedback.json
Read feedback and iterate (back to Step 2)

Stop iterating when:

User says they're happy
All feedback is empty (everything looks good)
Benchmark shows no meaningful progress between iterations
Pass rate meets the defined thresholds

Step 3b: Self-Annealing Loop

For quick fixes without full eval pipeline:

1. Apply the fix
2. Test with the original failing case
3. Test with 3 other cases (regression check)
4. If fix works:
   -> Update the directive/SKILL.md
   -> Document the learning in references or SKILL.md
5. If fix fails:
   -> Diagnose why
   -> Try alternative approach
   -> Repeat

Step 3c: Description Optimization Loop

For triggering issues (Category A), use the automated optimization loop:

Generate trigger eval set: python scripts/generate_eval_set.py <path>
Review and refine the eval set with the user
Run optimization: python scripts/optimize_description.py <path> --eval-set evals.json
Review the train/test split scores and improvement suggestions
Apply suggested description changes
Re-run optimization to measure improvement
Select the description with the highest test score (not train — avoids overfitting)
Iterate up to 5 times or until test score plateaus

Step 4: Architecture Evolution

When a skill outgrows its tier:

Tier 1 -> Tier 2 (needs scripts):

Identify the fragile/deterministic operation
Create script in scripts/
Update SKILL.md to reference the script
Test script independently

Tier 2 -> Tier 3 (needs sub-skills):

Identify distinct workflows that can be separated
Extract each into its own skills/{parent}-{child}/SKILL.md
Update main SKILL.md with routing table
Move shared knowledge to references/
Update install.sh

Tier 3 -> Tier 4 (needs agents):

Identify workflows that can run in parallel
Create agent definitions in agents/
Update the audit/full-analysis sub-skill to delegate to agents
Test parallel execution

Step 5: Version Management

After evolution:

Update metadata.version in frontmatter (if present)
Add learning to SKILL.md or reference file
Update any affected cross-references
Re-run validation: python scripts/validate_skill.py <path>
Test full workflow end-to-end

Common Evolution Patterns

Pattern: Adding Industry Detection

When a skill needs to adapt behavior by user type:

## Industry Detection
Detect user type from context:
- **Type A**: [signals] -> [behavior]
- **Type B**: [signals] -> [behavior]

Pattern: Adding Quality Gates

When output quality is inconsistent:

## Quality Gates
Before delivering output:
- [ ] [Check 1]
- [ ] [Check 2]
- [ ] [Check 3]

Pattern: Adding Scoring

When users need measurable output:

## Scoring (0-100)
| Category | Weight |
|----------|--------|
| Category A | 30% |
| Category B | 30% |
| Category C | 20% |
| Category D | 20% |

skill-forge-evolve