skill-maker
Skill Maker
Overview
Creating skills IS Test-Driven Development applied to process documentation.
Write test cases (pressure scenarios with subagents), watch them fail (baseline behavior), write the skill (documentation), watch tests pass (agents comply), and refactor (close loopholes).
Core principle: If you didn't watch an agent fail without the skill, you don't know if the skill teaches the right thing.
What is a Skill?
Skills are modular packages that extend Claude's capabilities with specialized knowledge, workflows, and tools — like "onboarding guides" for specific domains.
Skills provide: Specialized workflows, tool integrations, domain expertise, bundled resources (scripts, references, assets)
Skills are: Reusable techniques, patterns, tools, reference guides
Skills are NOT: Narratives, one-off solutions, or project-specific conventions (use CLAUDE.md for those)
Skill Types: Technique (concrete steps), Pattern (mental models), Reference (API docs/guides), Discipline-Enforcing (rules/requirements)
Create when: Technique wasn't obvious, applies broadly, would be referenced again, ensures consistent behavior across instances
Don't create for: One-off solutions, standard well-documented practices, project-specific conventions
Skill Anatomy
skill-name/
├── SKILL.md (required)
│ ├── YAML frontmatter: name (required), description (required), allowed-tools (optional)
│ └── Markdown instructions
└── Bundled Resources (optional)
├── scripts/ - Executable code (deterministic, token-efficient)
├── references/ - Documentation loaded as needed (schemas, APIs)
└── assets/ - Files used in output (templates, boilerplate)
Frontmatter rules:
name: Letters, numbers, hyphens only (max 64 chars)description: Third-person, starts with "Use when...", includes triggers AND purpose (max 1024 chars)allowed-tools: Optional list restricting tool access
Writing style: Use imperative/infinitive form ("To accomplish X, do Y"), not second person
Progressive Disclosure: Metadata always loaded (~100 words) → SKILL.md on trigger (<5k words) → Resources as needed
Reference best practices: Keep SKILL.md lean; if reference files >10k words, include grep patterns in SKILL.md; avoid duplicating content between SKILL.md and references
The Iron Law
NO SKILL WITHOUT A FAILING TEST FIRST
Applies to NEW skills AND EDITS. Write/edit skill before testing? Delete it. Start over. No exceptions — not for "simple additions", "just a section", or "documentation updates". Delete means delete.
RED-GREEN-REFACTOR for Skills
RED: Baseline Testing
Run pressure scenarios WITHOUT the skill. Document exact behavior:
- What choices did the agent make?
- What rationalizations did they use (verbatim)?
- Which pressures triggered violations?
Method: Create new conversation/subagent → present scenario → apply pressure (time constraints, sunk cost, authority, exhaustion) → document failures
GREEN: Write Minimal Skill
Write skill addressing those specific rationalizations. Don't add content for hypothetical cases.
Run same scenarios WITH skill present. Agent should now comply. Document any new violations.
REFACTOR: Close Loopholes
For each new violation: identify rationalization → add explicit counter → re-test until bulletproof.
For discipline-enforcing skills: Build rationalization table, create "Red Flags" list, add "No exceptions" counters, address "spirit vs letter" arguments.
Skill Creation Process
Step 1: Understanding
Clarify concrete examples of usage. Ask: "What should this skill support?", "What triggers it?", "How would it be used?"
UltraThink before creating structure:
"Let me ultrathink what this skill should really accomplish and how it fits the ecosystem."
Question fundamentals: Is this a skill or project docs? What's reusable vs one-off? How will Claude discover it? What rationalizations will agents use? What are we NOT including?
Step 2: Plan Resources
For each use case: How would you execute from scratch? What scripts, references, or assets would help when executing repeatedly?
Step 3: Initialize Structure
mkdir -p dev-workflow/skills/skill-name/{scripts,references,assets}
Create SKILL.md with frontmatter template:
---
name: skill-name
description: Use when [triggers] - [what it does]
allowed-tools: Read, Write, Edit, Glob, Grep
---
# Skill Name
## Overview
[Core principle in 1-2 sentences]
## When to Use
[Activation triggers; when NOT to use]
## [Main content sections]
Step 4: Execute RED-GREEN-REFACTOR
Follow the cycle above: baseline test (RED) → write content addressing failures (GREEN) → close loopholes (REFACTOR).
Skill content must answer: What is the purpose? When should it be used? How should Claude use it?
Step 5: Iterate After Deployment
Use skill on real tasks → notice struggles → baseline test again → update → verify.
Claude Search Optimization (CSO)
Description field: Start with "Use when...", use concrete triggers/symptoms, describe the problem not language-specific symptoms, write in third person.
Keywords: Include error messages, symptoms, synonyms, and tool names Claude would search for.
Naming: Use active voice, verb-first, gerunds work well (e.g., creating-skills, testing-async-code).
Token efficiency targets: Frequently-loaded skills <500 words, others <1000 words, getting-started <200 words. Move large content to references, cross-reference other skills, use one excellent example over many mediocre ones.
Testing by Skill Type
| Type | Test With | Success Criteria |
|---|---|---|
| Discipline | Academic questions + pressure scenarios (3+ combined: time, sunk cost, exhaustion) | Agent follows rule under maximum pressure |
| Technique | Application scenarios, variations, missing info tests | Agent applies technique to new scenario |
| Pattern | Recognition, application, counter-examples | Agent knows when/how to apply (and when NOT to) |
| Reference | Retrieval, application, gap testing | Agent finds and correctly applies information |
Skill Creation Checklist
Use TodoWrite to create todos for EACH item.
RED Phase:
- Create pressure scenarios (3+ combined pressures for discipline skills)
- Run scenarios WITHOUT skill - document baseline verbatim
- Identify patterns in rationalizations/failures
GREEN Phase:
- Valid frontmatter: name (letters/numbers/hyphens), description ("Use when...", third person, <1024 chars)
- Keywords throughout for search (errors, symptoms, tools)
- Clear overview with core principle
- Address specific baseline failures from RED
- One excellent example (not multi-language)
- Run scenarios WITH skill - verify compliance
REFACTOR Phase:
- Add counters for new rationalizations
- Build rationalization table and red flags list
- Re-test until bulletproof
Quality Checks:
- Overview answers: What? When? How?
- Quick reference table or bullets
- Common mistakes section
- No narrative storytelling
- Token count within targets
Deployment:
- Commit skill to git
- Update documentation if needed
Anti-Patterns
- Narrative examples ("In session 2025-10-03...") — too specific, not reusable
- Multi-language dilution (example-js, example-py, example-go) — mediocre quality, maintenance burden
- Generic labels (helper1, step3) — use semantic names
- Untested skills — guarantees issues in production
- Vague descriptions ("Helps with coding") — Claude won't know when to activate
Integration with Dev-Workflow
Leverage existing skills: test-driven-development for testing, review for code review, documentation for docs, brainstorming for design exploration.
Activation context: Consider when skill activates in the workflow (planning, implementation, quality) and whether it auto-activates or requires explicit invocation.
Tool restrictions by phase:
- Planning:
Read, Write, Glob, Grep - Implementation:
Read, Write, Edit, Glob, Grep, Bash - Quality:
Read, Grep, Glob
The Bottom Line
Creating skills IS TDD for process documentation. Same Iron Law, same cycle (RED → GREEN → REFACTOR), same benefits. If you follow TDD for code, follow it for skills.