skill-creator
Skill Creator
A skill for creating new skills and iteratively improving them.
At a high level, the process of creating a skill goes like this:
- Decide what you want the skill to do and roughly how it should do it
- Write a draft of the skill
- Create a few test prompts and run claude-with-access-to-the-skill on them
- Evaluate the results
- which can be through automated evals, but also it's totally fine and good for them to be evaluated by the human by hand and that's often the only way
- Rewrite the skill based on feedback from the evaluation
- Repeat until you're satisfied
- Expand the test set and try again at larger scale
Your job when using this skill is to figure out where the user is in this process and then jump in and help them progress through these stages. So for instance, maybe they're like "I want to make a skill for X". You can help narrow down what they mean, write a draft, write the test cases, figure out how they want to evaluate, run all the prompts, and repeat.
On the other hand, maybe they already have a draft of the skill. In this case you can go straight to the eval/iterate part of the loop.
Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead.
Cool? Cool.
Building Blocks
The skill-creator operates on composable building blocks. Each has well-defined inputs and outputs.
| Building Block | Input | Output | Agent |
|---|---|---|---|
| Eval Run | skill + eval prompt + files | transcript, outputs, metrics | agents/executor.md |
| Grade Expectations | outputs + expectations | pass/fail per expectation | agents/grader.md |
| Blind Compare | output A, output B, eval prompt | winner + reasoning | agents/comparator.md |
| Post-hoc Analysis | winner + skills + transcripts | improvement suggestions | agents/analyzer.md |
Eval Run
Executes a skill on an eval prompt and produces measurable outputs.
- Input: Skill path, eval prompt, input files
- Output:
transcript.md,outputs/,metrics.json - Metrics captured: Tool calls, execution steps, output size, errors
Grade Expectations
Evaluates whether outputs meet defined expectations.
- Input: Expectations list, transcript, outputs directory
- Output:
grading.jsonwith pass/fail per expectation plus evidence - Purpose: Objective measurement of skill performance
Blind Compare
Compares two outputs without knowing which skill produced them.
- Input: Output A path, Output B path, eval prompt, expectations (optional)
- Output: Winner (A/B/TIE), reasoning, quality scores
- Purpose: Unbiased comparison between skill versions
Post-hoc Analysis
After blind comparison, analyzes WHY the winner won.
- Input: Winner identity, both skills, both transcripts, comparison result
- Output: Winner strengths, loser weaknesses, improvement suggestions
- Purpose: Generate actionable improvements for next iteration
Environment Capabilities
Check whether you can spawn subagents — independent agents that execute tasks in parallel. If you can, you'll delegate work to executor, grader, comparator, and analyzer agents. If not, you'll do all work inline, sequentially.
This affects which modes are available and how they execute. The core workflows are the same — only the execution strategy changes.
Mode Workflows
Building blocks combine into higher-level workflows for each mode:
| Mode | Purpose | Workflow |
|---|---|---|
| Eval | Test skill performance | Executor → Grader → Results |
| Improve | Iteratively optimize skill | Executor → Grader → Comparator → Analyzer → Apply |
| Create | Interactive skill development | Interview → Research → Draft → Run → Refine |
| Benchmark | Standardized performance measurement (requires subagents) | 3x runs per configuration → Aggregate → Analyze |
See references/mode-diagrams.md for detailed visual workflow diagrams.
Task Tracking
Use tasks to track progress on multi-step workflows.
Task Lifecycle
Each eval run becomes a task with stage progression:
pending → planning → implementing → reviewing → verifying → completed
(prep) (executor) (grader) (validate)
Creating Tasks
When running evals, create a task per eval run:
TaskCreate(
subject="Eval 0, run 1 (with_skill)",
description="Execute skill eval 0 with skill and grade expectations",
activeForm="Preparing eval 0"
)
Updating Stages
Progress through stages as work completes:
TaskUpdate(task, status="planning") # Prepare files, stage inputs
TaskUpdate(task, status="implementing") # Spawn executor subagent
TaskUpdate(task, status="reviewing") # Spawn grader subagent
TaskUpdate(task, status="verifying") # Validate outputs exist
TaskUpdate(task, status="completed") # Done
Comparison Tasks
For blind comparisons (after all runs complete):
TaskCreate(
subject="Compare skill-v1 vs skill-v2"
)
# planning = gather outputs
# implementing = spawn blind comparators
# reviewing = tally votes, handle ties
# verifying = if tied, run more comparisons or use efficiency
# completed = declare winner
Architecture
The coordinator (this skill):
- Asks the user what they want to do and which skill to work on
- Determines workspace location (ask if not obvious)
- Creates workspace and tasks for tracking progress
- Delegates work to subagents when available, otherwise executes inline
- Tracks the best version (not necessarily the latest)
- Reports results with evidence and metrics
Agent Types
| Agent | Role | Reference |
|---|---|---|
| Executor | Run skill on a task, produce transcript + outputs + metrics | agents/executor.md |
| Grader | Evaluate expectations against transcript and outputs | agents/grader.md |
| Comparator | Blind A/B comparison between two outputs | agents/comparator.md |
| Analyzer | Post-hoc analysis of comparison results | agents/analyzer.md |
Communicating with the user
The skill creator is liable to be used by people across a wide range of familiarity with coding jargon. If you haven't heard (and how could you, it's only very recently that it started), there's a trend now where the power of Claude is inspiring plumbers to open up their terminals, parents and grandparents to google "how to install npm". On the other hand, the bulk of users are probably fairly computer-literate.
So please pay attention to context cues to understand how to phrase your communication! In the default case, just to give you some idea:
- "evaluation" and "benchmark" are borderline, but OK
- for "JSON" and "assertion" you want to see serious cues from the user that they know what those things are before using them without explaining them
It's OK to briefly explain terms if you're in doubt, and feel free to clarify terms with a short definition if you're unsure if the user will get it.
Creating a skill
Capture Intent
Start by understanding the user's intent. The current conversation might already contain a workflow the user wants to capture (e.g., they say "turn this into a skill"). If so, extract answers from the conversation history first — the tools used, the sequence of steps, corrections the user made, input/output formats observed. The user may need to fill the gaps, and should confirm before proceeding to the next step.
- What should this skill enable Claude to do?
- When should this skill trigger? (what user phrases/contexts)
- What's the expected output format?
- Should we set up test cases to verify the skill works? Skills with objectively verifiable outputs (file transforms, data extraction, code generation, fixed workflow steps) benefit from test cases. Skills with subjective outputs (writing style, art) often don't need them. Suggest the appropriate default based on the skill type, but let the user decide.
Interview and Research
Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies.
Check available MCPs - if useful for research (searching docs, finding similar skills, looking up best practices), research in parallel via subagents if available, otherwise inline. Come prepared with context to reduce burden on the user.
Initialize
Run the initialization script:
scripts/init_skill.py <skill-name> --path <output-directory>
This creates:
- SKILL.md template with frontmatter
- scripts/, references/, assets/ directories
- Example files to customize or delete
Fill SKILL.md Frontmatter
Based on interview, fill:
- name: Skill identifier
- description: When to trigger, what it does. This is the primary triggering mechanism - include both what the skill does AND specific contexts for when to use it. All "when to use" info goes here, not in the body. Note: currently Claude has a tendency to "undertrigger" skills -- to not use them when they'd be useful. To combat this, please make the skill descriptions a little bit "pushy". So for instance, instead of "How to build a simple fast dashboard to display internal Anthropic data.", you might write "How to build a simple fast dashboard to display internal Anthropic data. Make sure to use this skill whenever the user mentions dashboards, data visualization, internal metrics, or wants to display any kind of company data, even if they don't explicitly ask for a 'dashboard.'"
Format constraint: The description MUST be a single-line value -- either plain (
description: Some text) or double-quoted (description: "Some text"). NEVER use YAML block scalar syntax (>,>-,|,|-) because many marketplace parsers do not expand them and will display the raw indicator character instead of the description text. Also avoid special Unicode characters like→or—in the description; use ASCII equivalents (->,--) instead. - compatibility: Required tools, dependencies (optional, rarely needed)
Skill Writing Guide
Anatomy of a Skill
skill-name/
├── SKILL.md (required)
│ ├── YAML frontmatter (name, description required)
│ └── Markdown instructions
└── Bundled Resources (optional)
├── scripts/ - Executable code for deterministic/repetitive tasks
├── references/ - Docs loaded into context as needed
└── assets/ - Files used in output (templates, icons, fonts)
What NOT to include: README.md, INSTALLATION_GUIDE.md, CHANGELOG.md, or any auxiliary documentation. Skills are for AI agents, not human onboarding.
Progressive Disclosure
Skills use a three-level loading system:
- Metadata (name + description) - Always in context (~100 words)
- SKILL.md body - In context whenever skill triggers (<500 lines ideal)
- Bundled resources - As needed (unlimited, scripts can execute without loading)
These word counts are approximate and you can feel free to go longer if needed.
Key patterns:
- Keep SKILL.md under 500 lines; if you're approaching this limit, add an additional layer of hierarchy along with clear pointers about where the model using the skill should go next to follow up.
- Reference files clearly from SKILL.md with guidance on when to read them
- For large reference files (>300 lines), include a table of contents
Domain organization: When a skill supports multiple domains/frameworks, organize by variant:
cloud-deploy/
├── SKILL.md (workflow + selection)
└── references/
├── aws.md
├── gcp.md
└── azure.md
Claude reads only the relevant reference file.
Principle of Lack of Surprise
This goes without saying, but skills must not contain malware, exploit code, or any content that could compromise system security. A skill's contents should not surprise the user in their intent if described. Don't go along with requests to create misleading skills or skills designed to facilitate unauthorized access, data exfiltration, or other malicious activities. Things like a "roleplay as an XYZ" are OK though.
Writing Patterns
Prefer using the imperative form in instructions.
Defining output formats - You can do it like this:
## Report structure
ALWAYS use this exact template:
# [Title]
## Executive summary
## Key findings
## Recommendations
Examples pattern - It's useful to include examples. You can format them like this (but if "Input" and "Output" are in the examples you might want to deviate a little):
## Commit message format
**Example 1:**
Input: Added user authentication with JWT tokens
Output: feat(auth): implement JWT-based authentication
Immediate Feedback Loop
Always have something cooking. Every time user adds an example or input:
- Immediately start running it - don't wait for full specification
- Show outputs in workspace - tell user: "The output is at X, take a look"
- First runs in main agent loop - not subagent, so user sees the transcript
- Seeing what Claude does helps user understand and refine requirements
Writing Style
Try to explain to the model why things are important in lieu of heavy-handed musty MUSTs. Use theory of mind and try to make the skill general and not super-narrow to specific examples. Start by writing a draft and then look at it with fresh eyes and improve it.
Test Cases
After writing the skill draft, come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user: [you don't have to use this exact language] "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?" Then run them.
If the user wants evals, create evals/evals.json with this structure:
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "User's task prompt",
"expected_output": "Description of expected result",
"files": [],
"assertions": [
"The output includes X",
"The skill correctly handles Y"
]
}
]
}
You can initialize with scripts/init_json.py evals evals/evals.json and validate with scripts/validate_json.py evals/evals.json. See references/schemas.md for the full schema.
Transition to Automated Iteration
Once gradable criteria are defined (expectations, success metrics), Claude can:
- More aggressively suggest improvements
- Run tests automatically (via subagents in the background if available, otherwise sequentially)
- Present results: "I tried X, it improved pass rate by Y%"
Package and Present (only if present_files tool is available)
Check whether you have access to the present_files tool. If you don't, skip this step. If you do, package the skill and present the .skill file to the user:
scripts/package_skill.py <path/to/skill-folder>
After packaging, direct the user to the resulting .skill file path so they can install it.
Improving a skill
When user asks to improve a skill, ask:
- Which skill? - Identify the skill to improve
- How much time? - How long can Claude spend iterating?
- What's the goal? - Target quality level, specific issues to fix, or general improvement
Claude should then autonomously iterate using the building blocks (run, grade, compare, analyze) to drive the skill toward the goal within the time budget.
Some advice on writing style when improving a skill:
-
Try to generalize from the feedback, rather than fixing specific examples one by one. The big picture thing that's happening here is that we're trying to create "skills" that can be used a million times (maybe literally, maybe even more who knows) across many different prompts. Here you and the user are iterating on only a few examples over and over again because it helps move faster. The user knows these examples in and out and it's quick for them to assess new outputs. But if the skill you and the user are codeveloping works only for those examples, it's useless. Rather than put in fiddley overfitty changes, or oppressively constrictive MUSTs, if there's some stubborn issue, you might try branching out and using different metaphors, or recommending different patterns of working. It's relatively cheap to try and maybe you'll land on something great.
-
Keep the prompt lean; remove things that aren't pulling their weight. Make sure to read the transcripts, not just the final outputs -- if it looks like the skill is making the model waste a bunch of time doing things that are unproductive, you can try getting rid of the parts of the skill that are making it do that and seeing what happens.
-
Last but not least, try hard to explain the why behind everything you're asking the model to do. Today's LLMs are smart. They have good theory of mind and when given a good harness and go beyond rote instructions and really make things happen. Even if the feedback from the user is terse or frustrated, try to actually understand the task and why the user is writing what they wrote, and what they actually wrote, and then try to transmit this understanding into the instructions. If you find yourself writing ALWAYS or NEVER in all caps, or using super rigid structures, that's a yellow flag - try to reframe and explain the reasoning so that the model understands why the thing you're asking for is important. That's a more humane, powerful, and effective approach.
This task is pretty important (we are trying to create billions a year in economic value here!) and your thinking time is not the blocker; take your time and really mull things over. I'd suggest writing a draft skill and then looking at it anew and making improvements. Really try to get into the head of the user and understand what they want and need. Best of luck.
Setup Phase
-
Read output schemas:
Read references/schemas.md # JSON structures for grading, history, comparison, analysisThis ensures you understand the structure of outputs you'll produce and validate.
-
Choose workspace location:
Ask the user where to put the workspace. Suggest
<skill-name>-workspace/as a sibling to the skill directory, but let the user choose. If the workspace ends up inside a git repo, check.gitignorefirst — if an existing pattern already covers the workspace path (e.g.,*-workspace/), do not add a duplicate entry. Only add the workspace path if no existing pattern matches. -
Copy skill to v0:
scripts/copy_skill.py <skill-path> <skill-name>-workspace/v0 --iteration 0 -
Verify or create evals:
- Check for existing
evals/evals.json - If missing, ask user for 2-3 example tasks and create evals
- Use
scripts/init_json.py evalsto create with correct structure
- Check for existing
-
Create tasks for baseline:
for run in range(3): TaskCreate( subject=f"Eval baseline, run {run+1}" ) -
Initialize history.json:
scripts/init_json.py history <workspace>/history.jsonThen edit to fill in skill_name. See
references/schemas.mdfor full structure.
Iteration Loop
For each iteration (0, 1, 2, ...):
Step 1: Execute (3 Parallel Runs)
Spawn 3 executor subagents in parallel (or run sequentially without subagents — see "Without subagents" below). Update task to implementing stage.
Spawn a subagent for each run with these instructions:
Read agents/executor.md at: <skill-creator-path>/agents/executor.md
Execute this task:
- Skill path: workspace/v<N>/skill/
- Task: <eval prompt from evals.json>
- Test files: <eval files if any>
- Save transcript to: workspace/v<N>/runs/run-<R>/transcript.md
- Save outputs to: workspace/v<N>/runs/run-<R>/outputs/
Step 2: Grade Assertions
Spawn grader subagents (or grade inline — see "Without subagents" below). Update task to reviewing stage.
Purpose: Grading produces structured pass/fail results for tracking pass rates over iterations. The grader also extracts claims and reads user_notes to surface issues that expectations might miss.
Set the grader up for success: The grader needs to actually inspect the outputs, not just read the transcript. If the outputs aren't plain text, tell the grader how to read them — check the skill for inspection tools it already uses and pass those as hints in the grader prompt.
Spawn a subagent with these instructions:
Read agents/grader.md at: <skill-creator-path>/agents/grader.md
Grade these expectations:
- Assertions: <list from evals.json>
- Transcript: workspace/v<N>/runs/run-<R>/transcript.md
- Outputs: workspace/v<N>/runs/run-<R>/outputs/
- Save grading to: workspace/v<N>/runs/run-<R>/grading.json
To inspect output files:
<include inspection hints from the skill, e.g.:>
<"Use python -m markitdown <file> to extract text content">
Review grading.json: Check user_notes_summary for uncertainties and workarounds flagged by the executor. Also check eval_feedback — if the grader flagged lax assertions or missing coverage, update evals.json before continuing. Improving evals mid-loop is fine and often necessary; you can't meaningfully improve a skill if the evals don't measure anything real.
Eval quality loop: If eval_feedback has suggestions, tighten the assertions and rerun the evals. Keep iterating as long as the grader keeps finding issues. Once eval_feedback says the evals look solid (or has no suggestions), move on to skill improvement. Consult the user about what you're doing, but don't block on approval for each round — just keep making progress.
When picking which eval to use for the quality loop, prefer one where the skill partially succeeds — some expectations pass, some fail. An eval where everything fails gives the grader nothing to critique (there are no false positives to catch). The feedback is most useful when some expectations pass and the grader can assess whether those passes reflect genuine quality or surface-level compliance.
Step 3: Blind Compare (If N > 0)
For iterations after baseline, use blind comparison:
Purpose: While grading tracks expectation pass rates, the comparator judges holistic output quality using a rubric. Two outputs might both pass all expectations, but one could still be clearly better. The comparator uses expectations as secondary evidence, not the primary decision factor.
Blind A/B Protocol:
- Randomly assign: 50% chance v is A, 50% chance v is B
- Record the assignment in
workspace/grading/v<N>-vs-best/assignment.json - Comparator sees only "Output A" and "Output B" - never version names
Spawn a subagent with these instructions:
Read agents/comparator.md at: <skill-creator-path>/agents/comparator.md
Blind comparison:
- Eval prompt: <the task that was executed>
- Output A: <path to one version's output>
- Output B: <path to other version's output>
- Assertions: <list from evals.json>
You do NOT know which is old vs new. Judge purely on quality.
Determine winner by majority vote:
- If 2+ comparators prefer A: A wins
- If 2+ comparators prefer B: B wins
- Otherwise: TIE
Step 4: Post-hoc Analysis
After blind comparison, analyze results. Spawn a subagent with these instructions:
Read agents/analyzer.md at: <skill-creator-path>/agents/analyzer.md
Analyze:
- Winner: <A or B>
- Winner skill: workspace/<winner-version>/skill/
- Winner transcript: workspace/<winner-version>/runs/run-1/transcript.md
- Loser skill: workspace/<loser-version>/skill/
- Loser transcript: workspace/<loser-version>/runs/run-1/transcript.md
- Comparison result: <from comparator>
Step 5: Update State
Update task to completed stage. Record results:
if new_version wins majority:
current_best = new_version
# Update history.json
history.iterations.append({
"version": "v<N>",
"parent": "<previous best>",
"expectation_pass_rate": 0.85,
"grading_result": "won" | "lost" | "tie",
"is_current_best": bool
})
Step 6: Create New Version (If Continuing)
-
Copy current best to new version:
scripts/copy_skill.py workspace/<current_best>/skill workspace/v<N+1> \ --parent <current_best> \ --iteration <N+1> -
Apply improvements from analyzer suggestions
-
Create new tasks for next iteration
-
Continue loop or stop if:
- Time budget exhausted: Track elapsed time, stop when approaching limit
- Goal achieved: Target quality level or pass rate reached
- Diminishing returns: No significant improvement in last 2 iterations
- User requests stop: Check for user input between iterations
Final Report
When iterations complete:
- Best Version: Which version performed best (not necessarily the last)
- Score Progression: Assertion pass rates across iterations
- Key Improvements: What changes had the most impact
- Recommendation: Whether to adopt the improved skill
Copy best skill back to main location:
cp -r workspace/<best_version>/skill/* ./
Check whether you have access to the present_files tool. If you do, package and present the improved skill, and direct the user to the resulting .skill file path so they can install it:
scripts/package_skill.py <path/to/skill-folder>
(If you don't have the present_files tool, don't run package_skill.py)
Without Subagents
Without subagents, Improve mode still works but with reduced rigor:
- Single run per iteration (not 3) — variance analysis isn't possible with one run
- Inline execution: Read
agents/executor.mdand follow the procedure directly in your main loop. Then readagents/grader.mdand follow it directly to grade the results. - No blind comparison: You can't meaningfully blind yourself since you have full context. Instead, compare outputs by re-reading both versions' results and analyzing the differences directly.
- No separate analyzer: Do the analysis inline after comparing — identify what improved, what regressed, and what to try next.
- Keep everything else: Version tracking, copy-iterate-grade loop, history.json, stopping criteria all work the same.
- Acknowledge reduced rigor: Without independent agents, grading is less rigorous — the same context that executed the task also grades it. Results are directional, not definitive.
Eval Mode
Run individual evals to test skill performance and grade expectations.
IMPORTANT: Before running evals, read the full documentation:
Read references/eval-mode.md # Complete Eval workflow
Read references/schemas.md # JSON output structures
Use Eval mode when:
- Testing a specific eval case
- Comparing with/without skill on a single task
- Quick validation during development
The workflow: Setup → Check Dependencies → Prepare → Execute → Grade → Display Results
Without subagents, execute and grade sequentially in the main loop. Read the agent reference files (agents/executor.md, agents/grader.md) and follow the procedures directly.
Benchmark Mode
Run standardized performance measurement with variance analysis.
Requires subagents. Benchmark mode relies on parallel execution of many runs to produce statistically meaningful results. Without subagents, use Eval mode for individual eval testing instead.
IMPORTANT: Before running benchmarks, read the full documentation:
Read references/benchmark-mode.md # Complete Benchmark workflow
Read references/schemas.md # JSON output structures
Use Benchmark mode when:
- "How does my skill perform?" - Understanding overall performance
- "Compare Sonnet vs Haiku" - Cross-model comparison
- "Has performance regressed?" - Tracking changes over time
- "Does the skill add value?" - Validating skill impact
Key differences from Eval:
- Runs all evals (not just one)
- Runs each 3 times per configuration for variance
- Always includes no-skill baseline
- Uses most capable model for analysis
Workspace Structure
Workspaces are created as sibling directories to the skill being worked on.
parent-directory/
├── skill-name/ # The skill
│ ├── SKILL.md
│ ├── evals/
│ │ ├── evals.json
│ │ └── files/
│ └── scripts/
│
└── skill-name-workspace/ # Workspace (sibling directory)
│
│── [Eval mode]
├── eval-0/
│ ├── with_skill/
│ │ ├── inputs/ # Staged input files
│ │ ├── outputs/ # Skill outputs
│ │ │ ├── transcript.md
│ │ │ ├── user_notes.md # Executor uncertainties
│ │ │ ├── metrics.json
│ │ │ └── [output files]
│ │ ├── grading.json # Assertions + claims + user_notes_summary
│ │ └── timing.json # Wall clock timing
│ └── without_skill/
│ └── ...
├── comparison.json # Blind comparison (A/B testing)
├── summary.json # Aggregate metrics
│
│── [Improve mode]
├── history.json # Score progression across versions
├── v0/
│ ├── META.yaml # Version metadata
│ ├── skill/ # Copy of skill at this version
│ └── runs/
│ ├── run-1/
│ │ ├── transcript.md
│ │ ├── user_notes.md
│ │ ├── outputs/
│ │ └── grading.json
│ ├── run-2/
│ └── run-3/
├── v1/
│ ├── META.yaml
│ ├── skill/
│ ├── improvements/
│ │ └── suggestions.md # From analyzer
│ └── runs/
└── grading/
└── v1-vs-v0/
├── assignment.json # Which version is A vs B
├── comparison-1.json # Blind comparison results
├── comparison-2.json
├── comparison-3.json
└── analysis.json # Post-hoc analysis
│
│── [Benchmark mode]
└── benchmarks/
└── 2026-01-15T10-30-00/ # Timestamp-named directory
├── benchmark.json # Structured results (see schema)
├── benchmark.md # Human-readable summary
└── runs/
├── eval-1/
│ ├── with_skill/
│ │ ├── run-1/
│ │ │ ├── transcript.md
│ │ │ ├── user_notes.md
│ │ │ ├── outputs/
│ │ │ └── grading.json
│ │ ├── run-2/
│ │ └── run-3/
│ └── without_skill/
│ ├── run-1/
│ ├── run-2/
│ └── run-3/
└── eval-2/
└── ...
Key files:
transcript.md- Execution log from executoruser_notes.md- Uncertainties and workarounds flagged by executormetrics.json- Tool calls, output size, step countgrading.json- Assertion pass/fail, notes, user_notes summarytiming.json- Wall clock durationcomparison-N.json- Blind rubric-based comparisonanalysis.json- Post-hoc analysis with improvement suggestionshistory.json- Version progression with pass rates and winnersbenchmark.json- Structured benchmark results with runs, run_summary, notesbenchmark.md- Human-readable benchmark summary
Coordinator Responsibilities
The coordinator must:
- Delegate to subagents when available; otherwise execute inline - In Improve, Eval, and Benchmark modes, use subagents for executor/grader work when possible. Without subagents, read the agent reference files and follow the procedures directly.
- Create mode exception - Run examples in main loop so user sees the transcript (interactive feedback matters more than consistency)
- Use independent grading when possible - Spawn separate grader/comparator agents for unbiased evaluation. Without subagents, grade inline but acknowledge the limitation.
- Track progress with tasks - Create tasks, update stages, mark complete
- Track best version - The best performer, not the latest iteration
- Run multiple times for variance - 3 runs per configuration when subagents are available; 1 run otherwise
- Parallelize independent work - When subagents are available, spawn independent work in parallel
- Report results clearly - Display pass/fail with evidence and metrics
- Review user_notes - Check executor's user_notes.md for issues that passed expectations might miss
- Capture execution metrics - In Benchmark mode, record tokens/time/tool_calls from each execution
- Use most capable model for analysis - Benchmark analyzer should use the smartest available model
Delegating Work
There are two patterns for delegating work to building blocks:
With subagents: Spawn an independent agent with the reference file instructions. Include the reference file path in the prompt so the subagent knows its role. When tasks are independent (like 3 runs of the same version), spawn all subagents in the same turn for parallelism.
Without subagents: Read the agent reference file (e.g., agents/executor.md) and follow the procedure directly in your main loop. Execute each step sequentially — the procedures are designed to work both as subagent instructions and as inline procedures.
Conclusion
Just pasting in the overall workflow again for reference:
- Decide what you want the skill to do and roughly how it should do it
- Write a draft of the skill
- Create a few test prompts and run claude-with-access-to-the-skill on them
- Evaluate the results
- which can be through automated evals, but also it's totally fine and good for them to be evaluated by the human by hand and that's often the only way
- Rewrite the skill based on feedback from the evaluation
- Repeat until you're satisfied
- Expand the test set and try again at larger scale
Good luck!