running-skills-edd-cycle
Running Skills EDD Cycle
Run evaluation-driven development cycle for agent skills.
Workflow
Step 1: Build Evaluations First
Create evaluations BEFORE writing documentation. This ensures skills solve real problems.
- Run Claude on representative tasks WITHOUT the skill
- Document specific failures or missing context
- Create 3+ evaluation scenarios that test these gaps
Evaluation scenarios are saved to tests/scenarios.md as the final step of /creating-effective-skills workflow.
Step 2: Establish Baseline
Measure Claude's performance WITHOUT the skill:
- Run each evaluation scenario
- Record: success/failure, missing context, wrong approaches
- This becomes comparison baseline
Step 3: Write Minimal Instructions
Create just enough content to address the gaps:
- Start with core workflow only
- Add detail only when tests fail
- Avoid over-explaining
REQUIRED: Use the Skill tool to invoke creating-effective-skills before writing any skill content. This ensures proper naming, description format, and structure from the start.
Step 4: Evaluate with Multiple Models
Note: This step requires Claude Code CLI. Skip if using Claude.ai.
REQUIRED: Use the Skill tool to invoke evaluating-skills-with-models with the skill path.
This will:
- Auto-load scenarios from
tests/scenarios.md - Execute with sub-agents across models (sonnet, opus, haiku)
- Evaluate against expected behaviors
- Determine recommended model (least capable with full compatibility)
After evaluation: Document recommended model in skill's metadata.
REQUIRED: Use the Skill tool to invoke improving-skills when observations reveal issues.
Step 5: Final Review
Before considering the skill complete:
REQUIRED: Use the Skill tool to invoke reviewing-skills to verify compliance with best practices.
- Address all compliance issues identified
- Re-run evaluations after fixes
- Repeat until skill passes review
Step 6: User Validation Guide
After all reviews pass, output instructions for user to validate in a fresh session:
## Test Your Skill
Run this command in a new terminal to test with a fresh Claude session:
claude --model {recommended_model} "{evaluation_query}"
After testing, paste the output file or result back to this session for final confirmation.
Replace:
{recommended_model}: Model determined in Step 4 (e.g.,sonnet){evaluation_query}: A representative query from your evaluations
Quick Reference
Cycle
Identify gaps -> Create evaluations -> Baseline -> Write minimal -> Model eval (sub-agents) -> Review -> User validation
What Observations Indicate
| Observation | Indicates |
|---|---|
| Unexpected file reading order | Structure not intuitive |
| Missed references | Links need to be explicit |
| Repeated reads of same file | Move content to SKILL.md |
| Never accessed file | Unnecessary or poorly signaled |
More from taisukeoe/agentic-ai-skills-creator
creating-effective-skills
Creating high-quality agent skills following Claude's official best practices. Use when designing, implementing, or improving agent skills, including naming conventions, progressive disclosure patterns, description writing, and appropriate freedom levels. Helps ensure skills are concise, well-structured, and optimized for context efficiency.
62improving-skills
Improve existing agent skills based on user feedback and best practices. Use when the user wants to fix, enhance, or refactor an existing skill. Gathers user feedback first, then applies technical analysis and implements improvements.
29reviewing-skills
Review skill files for best practices compliance (naming, description, structure, size). Use when checking SKILL.md quality or getting feedback before publishing. Static analysis only - does NOT execute the skill.
23setting-up-devcontainers
Generate devcontainer configurations for Claude Code development environments. Use when setting up development containers with Claude Code and optional Codex CLI. Automatically detects marketplace.json for plugin marketplace configurations.
15reviewing-plugin-marketplace
Review Claude Code plugin marketplace configurations against official best practices. Use when analyzing marketplace.json and plugin.json files for structural issues, common errors, path validation, and consistency with Anthropic's official format. Detects repository URL mismatches, incorrect source paths, and missing required fields.
13evaluating-skills-with-models
Evaluate skills by executing them across sonnet, opus, and haiku models using sub-agents. Use when testing if a skill works correctly, comparing model performance, or finding the cheapest compatible model. Returns numeric scores (0-100) to differentiate model capabilities.
12