running-skills-edd-cycle
Running Skills EDD Cycle
Run evaluation-driven development cycle for agent skills.
Workflow
Step 1: Build Evaluations First
Create evaluations BEFORE writing documentation. This ensures skills solve real problems.
- Run Claude on representative tasks WITHOUT the skill
- Document specific failures or missing context
- Create 3+ evaluation scenarios that test these gaps
Evaluation scenarios are saved to tests/scenarios.md as the final step of /creating-effective-skills workflow.
Step 2: Establish Baseline
Measure Claude's performance WITHOUT the skill:
- Run each evaluation scenario
- Record: success/failure, missing context, wrong approaches
- This becomes comparison baseline
Step 3: Write Minimal Instructions
Create just enough content to address the gaps:
- Start with core workflow only
- Add detail only when tests fail
- Avoid over-explaining
REQUIRED: Use the Skill tool to invoke creating-effective-skills before writing any skill content. This ensures proper naming, description format, and structure from the start.
Step 4: Evaluate with Multiple Models
Note: This step requires Claude Code CLI. Skip if using Claude.ai.
REQUIRED: Use the Skill tool to invoke evaluating-skills-with-models with the skill path.
This will:
- Auto-load scenarios from
tests/scenarios.md - Execute with sub-agents across models (sonnet, opus, haiku)
- Evaluate against expected behaviors
- Determine recommended model (least capable with full compatibility)
After evaluation: Document recommended model in skill's metadata.
REQUIRED: Use the Skill tool to invoke improving-skills when observations reveal issues.
Step 5: Final Review
Before considering the skill complete:
REQUIRED: Use the Skill tool to invoke reviewing-skills to verify compliance with best practices.
- Address all compliance issues identified
- Re-run evaluations after fixes
- Repeat until skill passes review
Step 6: User Validation Guide
After all reviews pass, output instructions for user to validate in a fresh session:
## Test Your Skill
Run this command in a new terminal to test with a fresh Claude session:
claude --model {recommended_model} "{evaluation_query}"
After testing, paste the output file or result back to this session for final confirmation.
Replace:
{recommended_model}: Model determined in Step 4 (e.g.,sonnet){evaluation_query}: A representative query from your evaluations
Quick Reference
Cycle
Identify gaps -> Create evaluations -> Baseline -> Write minimal -> Model eval (sub-agents) -> Review -> User validation
What Observations Indicate
| Observation | Indicates |
|---|---|
| Unexpected file reading order | Structure not intuitive |
| Missed references | Links need to be explicit |
| Repeated reads of same file | Move content to SKILL.md |
| Never accessed file | Unnecessary or poorly signaled |