skill-creator
Skill Creator
A skill for creating new skills and iteratively improving them in OpenCode.
The process at a high level:
- Decide what the skill should do and roughly how it should work
- Write a draft of the skill
- Create a few test prompts and run them via Task subagents
- Help the user evaluate the results (qualitatively, and quantitatively if applicable)
- Rewrite the skill based on feedback
- Repeat until satisfied
- Expand the test set and try again at larger scale
Your job is to figure out where the user is in this process and help them progress. Maybe they want to create a skill from scratch — help narrow scope, draft it, write test cases, run them, iterate. Maybe they already have a draft — jump straight to evaluation. Be flexible: if the user says "just vibe with me," skip the formal eval loop.
Communicating with the User
Pay attention to context cues about the user's technical familiarity. There's a growing trend where people across all backgrounds are using AI coding tools. In the default case:
- "evaluation" and "benchmark" are borderline, but OK
- For "JSON" and "assertion," see cues from the user before using them without explanation
Briefly explain terms if in doubt. Clarify with a short definition if unsure the user will follow.
Creating a Skill
Capture Intent
Start by understanding the user's intent. The conversation might already contain a workflow the user wants to capture (e.g., "turn this into a skill"). If so, extract answers from conversation history first — tools used, step sequences, corrections made, input/output formats observed. The user fills gaps and confirms before proceeding.
- What should this skill enable the agent to do?
- When should this skill trigger? (what user phrases/contexts)
- What's the expected output format?
- Should we set up test cases? Skills with objectively verifiable outputs (file transforms, data extraction, code generation) benefit from test cases. Subjective skills (writing style, creative work) often don't. Suggest the appropriate default, but let the user decide.
Interview and Research
Proactively ask about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until this part is ironed out.
Check available MCPs — if useful for research (searching docs, finding similar skills), research in parallel via Task subagents. Come prepared with context to reduce burden on the user.
Write the SKILL.md
Based on the interview, fill in these components:
- name: Skill identifier (lowercase, hyphens, matches directory name)
- description: When to trigger, what it does. This is the primary triggering mechanism — include both what the skill does AND specific contexts for when to use it. All "when to use" info goes here, not in the body. Make descriptions slightly "pushy" to combat under-triggering. For example, instead of "Build dashboards for internal data," write "Build dashboards for internal data. Use this skill whenever the user mentions dashboards, data visualization, internal metrics, or wants to display any kind of data, even if they don't explicitly ask for a 'dashboard.'"
- compatibility: Required tools, dependencies (optional, rarely needed)
- the rest of the skill
Skill Writing Guide
Anatomy of a Skill
skill-name/
├── SKILL.md (required)
│ ├── YAML frontmatter (name, description required)
│ └── Markdown instructions
└── Bundled Resources (optional)
├── scripts/ - Executable code for deterministic/repetitive tasks
├── references/ - Docs loaded into context as needed
└── assets/ - Files used in output (templates, icons, fonts)
For detailed anatomy patterns, see anatomy.md. For YAML frontmatter spec, see frontmatter.md. For bundled resource details, see bundled-resources.md.
Progressive Disclosure
Skills use a three-level loading system:
- Metadata (name + description) — Always in context (~100 words)
- SKILL.md body — Loaded when skill triggers (<500 lines ideal)
- Bundled resources — Loaded on demand (unlimited, scripts execute without loading)
Word counts are approximate; go longer if needed.
Key patterns:
- Keep SKILL.md under 500 lines; approaching the limit, add hierarchy with clear pointers
- Reference files clearly from SKILL.md with guidance on when to read them
- For large reference files (>300 lines), include a table of contents
For more, see progressive-disclosure.md.
Domain organization — when a skill supports multiple domains/frameworks, organize by variant:
cloud-deploy/
├── SKILL.md (workflow + selection)
└── references/
├── aws.md
├── gcp.md
└── azure.md
The agent reads only the relevant reference file.
Principle of Lack of Surprise
Skills must not contain malware, exploit code, or content that could compromise security. A skill's contents should not surprise the user in their intent if described. Don't create misleading skills or skills designed for unauthorized access or data exfiltration. Roleplay-style skills are fine.
Writing Patterns
Prefer the imperative form in instructions.
Defining output formats:
## Report structure
ALWAYS use this exact template:
# [Title]
## Executive summary
## Key findings
## Recommendations
Examples pattern:
## Commit message format
**Example 1:**
Input: Added user authentication with JWT tokens
Output: feat(auth): implement JWT-based authentication
Writing Style
Explain to the model why things are important rather than piling on heavy-handed MUSTs. Use theory of mind and make the skill general, not narrow to specific examples. Write a draft, look at it with fresh eyes, and improve.
Test Cases
After writing the draft, come up with 2-3 realistic test prompts — things a real user would actually say. Share them: "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?"
Save test cases to evals/evals.json:
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "User's task prompt",
"expected_output": "Description of expected result",
"files": []
}
]
}
Running and Evaluating Test Cases
This section is one continuous sequence — don't stop partway through.
Put results in a -workspace/ directory as a sibling to the skill directory. Organize by iteration (iteration-1/, iteration-2/, etc.) and within that, each test case gets a directory (eval-0/, eval-1/, etc.). Create directories as you go.
Step 1: Spawn All Runs in the Same Turn
For each test case, launch two Task subagents in the same turn — one with the skill, one without. Launch everything at once so it finishes around the same time.
With-skill run — use the general subagent type:
Task(subagent_type="general", prompt="""
Execute this task using the skill instructions found at <path-to-skill/SKILL.md>.
Read the SKILL.md first, then follow its guidance to complete:
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save all outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- Return: summary of what was produced, any issues encountered
""")
Baseline run — same prompt, no skill reference:
- New skill: No skill at all. Same prompt, save to
without_skill/outputs/. - Improving existing skill: The old version. Snapshot the skill before editing (
cp -r), point the baseline at the snapshot. Save toold_skill/outputs/.
Write an eval_metadata.json for each test case. Give each eval a descriptive name:
{
"eval_id": 0,
"eval_name": "descriptive-name-here",
"prompt": "The user's task prompt",
"assertions": []
}
Step 2: While Runs Are in Progress, Draft Assertions
Use this time productively. Draft quantitative assertions for each test case and explain them to the user. Good assertions are objectively verifiable with descriptive names that read clearly at a glance.
Subjective skills (writing style, design quality) are better evaluated qualitatively — don't force assertions onto things that need human judgment.
Update eval_metadata.json and evals/evals.json with the assertions once drafted.
Step 3: Grade and Review
Once all runs are done:
-
Grade each run — evaluate assertions against outputs, either inline or via a
thriftyTask subagent. Save results tograding.jsonin each run directory with this format:{ "expectations": [ {"text": "Output contains valid JSON", "passed": true, "evidence": "File output.json parsed successfully"}, {"text": "Response under 200 lines", "passed": false, "evidence": "Output was 247 lines"} ] }For assertions that can be checked programmatically, write and run a script — scripts are faster, more reliable, and reusable across iterations.
-
Launch the eval viewer to let the user review outputs in their browser:
nohup python <skill-creator-path>/eval-viewer/generate_review.py \ <workspace>/iteration-N \ --skill-name "my-skill" \ --benchmark <workspace>/iteration-N/benchmark.json \ > /dev/null 2>&1 & VIEWER_PID=$!For iteration 2+, also pass
--previous-workspace <workspace>/iteration-<N-1>.If no display is available, use
--static <path>.htmlto write a standalone HTML file instead of starting a server.The viewer has two tabs:
- Outputs — navigate test cases, see prompts/outputs/grading, leave feedback per case
- Benchmark — quantitative comparison (pass rates, timing, tokens)
Tell the user: "I've opened the results in your browser. Review each test case and leave feedback, then come back and tell me you're done."
-
Analyze patterns — surface things aggregate stats might hide: assertions that always pass regardless of skill (non-discriminating), high-variance evals, time/token tradeoffs.
Step 4: Collect Feedback
When the user says they're done, read feedback.json from the workspace:
{
"reviews": [
{"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels", "timestamp": "..."},
{"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."}
],
"status": "complete"
}
Empty feedback means the user thought it was fine. Focus improvements on test cases with specific complaints.
Kill the viewer server when done:
kill $VIEWER_PID 2>/dev/null
Improving the Skill
This is the heart of the loop. You've run test cases, the user has reviewed results, now make it better.
How to Think About Improvements
-
Generalize from feedback. The skill will be used many times across many different prompts. You and the user are iterating on a few examples because it's fast. But if the skill only works for those examples, it's useless. Rather than fiddly, overfitting changes or oppressive MUSTs, if something is stubborn, try different metaphors or recommend different working patterns. It's cheap to experiment.
-
Keep the prompt lean. Remove things that aren't pulling their weight. Read the transcripts, not just final outputs — if the skill makes the agent waste time on unproductive tangents, remove those instructions.
-
Explain the why. Try hard to explain the why behind everything. LLMs are smart. They have good theory of mind and when given a good harness can go beyond rote instructions. Even if user feedback is terse, understand the task and transmit that understanding into the instructions. If you're writing ALWAYS or NEVER in caps or using rigid structures, that's a yellow flag — reframe and explain reasoning so the model understands importance. That's more effective.
-
Look for repeated work across test cases. Read transcripts and notice if agents independently wrote similar helper scripts or took the same multi-step approach. If all test cases produce similar boilerplate, bundle that script in
scripts/and tell the skill to use it. Save every future invocation from reinventing the wheel.
This task matters. Take your time and mull things over. Write a draft revision, look at it fresh, and improve. Get into the user's head and understand what they need.
The Iteration Loop
After improving the skill:
- Apply improvements
- Rerun all test cases into a new
iteration-<N>/directory, including baselines - Present results with comparison to previous iteration
- Wait for user review
- Read new feedback, improve again, repeat
Keep going until:
- The user says they're happy
- Feedback is all empty (everything looks good)
- You're not making meaningful progress
Description Optimization
The description field in SKILL.md frontmatter is the primary mechanism that determines whether the agent invokes a skill. After creating or improving a skill, offer to optimize the description for better triggering accuracy.
Step 1: Generate Trigger Eval Queries
Create 20 eval queries — a mix of should-trigger and should-not-trigger. Queries must be realistic, specific, and detailed. Include file paths, personal context, column names, URLs, backstory. Mix lengths and focus on edge cases.
Bad: "Format this data", "Extract text from PDF", "Create a chart"
Good: "ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think"
Should-trigger (8-10): Different phrasings of the same intent — formal, casual. Cases where the user doesn't name the skill but clearly needs it. Uncommon use cases. Cases where this skill competes with another but should win.
Should-not-trigger (8-10): Near-misses sharing keywords but needing something different. Adjacent domains, ambiguous phrasing where naive keyword match would trigger but shouldn't. Don't make them obviously irrelevant — they should be genuinely tricky.
Save as JSON:
[
{"query": "the user prompt", "should_trigger": true},
{"query": "another prompt", "should_trigger": false}
]
Step 2: Review with User
Present the eval set. Let them edit queries, toggle should-trigger, add/remove entries. This step matters — bad eval queries lead to bad descriptions.
Step 3: Iterate on the Description
For each query, assess whether the current description would trigger correctly. For failures, analyze why and propose improvements. Re-test. Iterate until triggering accuracy is high on both should-trigger and should-not-trigger cases.
Focus on held-out test performance (not just the queries you tuned against) to avoid overfitting.
Step 4: Apply the Result
Update the skill's SKILL.md frontmatter with the improved description. Show the user before/after and report accuracy.
How Skill Triggering Works
Skills appear in the agent's available_skills list with name + description. The agent decides whether to load a skill based on that description. The agent only consults skills for tasks it can't easily handle on its own — simple one-step queries may not trigger even if the description matches, because the agent handles them directly. Complex, multi-step, or specialized queries reliably trigger when the description matches.
Eval queries should be substantive enough that the agent would benefit from consulting a skill.
Updating an Existing Skill
If the user is updating an existing skill (not creating new):
- Preserve the original name. Note the directory name and
namefrontmatter field — use them unchanged. - Copy to a writable location before editing if the installed path is read-only. Copy to
/tmp/skill-name/, edit there, then copy back.
Skill Locations in OpenCode
| Priority | Location |
|---|---|
| 1 | .opencode/skills/<name>/ (project-level) |
| 2 | ~/.config/opencode/skills/<name>/ (global) |
Discovery walks up from CWD to git root. First match wins for duplicate names.
Reference Files
| File | Purpose |
|---|---|
| anatomy.md | Skill directory structures |
| frontmatter.md | YAML spec, naming, validation |
| progressive-disclosure.md | Token-efficient design |
| bundled-resources.md | scripts/, references/, assets/ |
| patterns.md | Real-world skill patterns |
| gotchas.md | Common mistakes + fixes |
Scripts
| Script | Purpose |
|---|---|
scripts/init_skill.sh |
Scaffold new skill directory |
scripts/validate_skill.sh |
Validate skill structure and frontmatter |
scripts/package_skill.sh |
Create distributable zip |
Pre-Flight Checklist
Before finalizing a skill:
- SKILL.md starts with
---(line 1, no blank lines) -
name:field present, matches directory name -
description:includes what + when to use (slightly pushy) - Closing
---after frontmatter - SKILL.md under 500 lines (use references/ for more)
- All internal links resolve
- Description tested against realistic trigger queries
Run: ./scripts/validate_skill.sh ./my-skill
Core Loop Summary
- Figure out what the skill is about
- Draft or edit the skill
- Run test prompts via Task subagents (with-skill + baseline, in parallel)
- Evaluate outputs with the user (qualitative review + quantitative evals if applicable)
- Improve the skill based on feedback
- Repeat until satisfied
- Optimize the description for triggering accuracy
- Validate and package
Track progress through this loop via TodoWrite to make sure steps don't get skipped.
See Also
- Cloudflare Skill — Reference implementation of a large platform skill
More from ajoslin/dot
kimaki-expert
Provide expert support for Kimaki setup, Discord bot wiring, OpenCode session orchestration, slash-command troubleshooting, and automation workflows. Use when users mention Kimaki, kimaki.xyz, Discord-controlled coding agents, or channel-to-project mapping.
18init-review-policy
Initialize repo-scoped code review policy files under .opencode/review. Use when setting up project-specific review rules for /code-review.
18kimaki-tools
Run common Kimaki project/session operations from OpenCode, especially linking an existing OpenCode session into Discord and Kimaki session listings. Use when users ask to manage Kimaki projects, sessions, or thread mappings from CLI.
18thrifty
Delegate tasks to the cost-effective opencode/glm-5 model. Use when you need inexpensive task execution, simple research, or delegating work that doesn't require the most powerful models.
17index-knowledge
Generate hierarchical AGENTS.md knowledge base for a codebase. Creates root + complexity-scored subdirectory documentation.
17tmux
Manage long-running terminal jobs in a dedicated tmux-opencode session using wrapper scripts for run/wait, health checks, and crash-recovery cleanup.
16