Skill Maker

Overview

Creating skills IS Test-Driven Development applied to process documentation.

Write test cases (pressure scenarios with subagents), watch them fail (baseline behavior), write the skill (documentation), watch tests pass (agents comply), and refactor (close loopholes).

Core principle: If you didn't watch an agent fail without the skill, you don't know if the skill teaches the right thing.

What is a Skill?

Skills are modular packages that extend Claude's capabilities with specialized knowledge, workflows, and tools — like "onboarding guides" for specific domains.

Skills provide: Specialized workflows, tool integrations, domain expertise, bundled resources (scripts, references, assets)

Skills are: Reusable techniques, patterns, tools, reference guides

Skills are NOT: Narratives, one-off solutions, or project-specific conventions (use CLAUDE.md for those)

Skill Types: Technique (concrete steps), Pattern (mental models), Reference (API docs/guides), Discipline-Enforcing (rules/requirements)

Create when: Technique wasn't obvious, applies broadly, would be referenced again, ensures consistent behavior across instances

Don't create for: One-off solutions, standard well-documented practices, project-specific conventions

Skill Anatomy

skill-name/
├── SKILL.md (required)
│   ├── YAML frontmatter: name (required), description (required), allowed-tools (optional)
│   └── Markdown instructions
└── Bundled Resources (optional)
    ├── scripts/     - Executable code (deterministic, token-efficient)
    ├── references/  - Documentation loaded as needed (schemas, APIs)
    └── assets/      - Files used in output (templates, boilerplate)

Frontmatter rules:

name: Letters, numbers, hyphens only (max 64 chars)
description: Third-person, starts with "Use when...", includes triggers AND purpose (max 1024 chars)
allowed-tools: Optional list restricting tool access

Writing style: Use imperative/infinitive form ("To accomplish X, do Y"), not second person

Progressive Disclosure: Metadata always loaded (~100 words) → SKILL.md on trigger (<5k words) → Resources as needed

Reference best practices: Keep SKILL.md lean; if reference files >10k words, include grep patterns in SKILL.md; avoid duplicating content between SKILL.md and references

The Iron Law

NO SKILL WITHOUT A FAILING TEST FIRST

Applies to NEW skills AND EDITS. Write/edit skill before testing? Delete it. Start over. No exceptions — not for "simple additions", "just a section", or "documentation updates". Delete means delete.

RED-GREEN-REFACTOR for Skills

RED: Baseline Testing

Run pressure scenarios WITHOUT the skill. Document exact behavior:

What choices did the agent make?
What rationalizations did they use (verbatim)?
Which pressures triggered violations?

Method: Create new conversation/subagent → present scenario → apply pressure (time constraints, sunk cost, authority, exhaustion) → document failures

GREEN: Write Minimal Skill

Write skill addressing those specific rationalizations. Don't add content for hypothetical cases.

Run same scenarios WITH skill present. Agent should now comply. Document any new violations.

REFACTOR: Close Loopholes

For each new violation: identify rationalization → add explicit counter → re-test until bulletproof.

For discipline-enforcing skills: Build rationalization table, create "Red Flags" list, add "No exceptions" counters, address "spirit vs letter" arguments.

Skill Creation Process

Step 1: Understanding

Clarify concrete examples of usage. Ask: "What should this skill support?", "What triggers it?", "How would it be used?"

UltraThink before creating structure:

"Let me ultrathink what this skill should really accomplish and how it fits the ecosystem."

Question fundamentals: Is this a skill or project docs? What's reusable vs one-off? How will Claude discover it? What rationalizations will agents use? What are we NOT including?

Step 2: Plan Resources

For each use case: How would you execute from scratch? What scripts, references, or assets would help when executing repeatedly?

Step 3: Initialize Structure

mkdir -p dev-workflow/skills/skill-name/{scripts,references,assets}

Create SKILL.md with frontmatter template:

---
name: skill-name
description: Use when [triggers] - [what it does]
allowed-tools: Read, Write, Edit, Glob, Grep
---

# Skill Name

## Overview
[Core principle in 1-2 sentences]

## When to Use
[Activation triggers; when NOT to use]

## [Main content sections]

Step 4: Execute RED-GREEN-REFACTOR

Follow the cycle above: baseline test (RED) → write content addressing failures (GREEN) → close loopholes (REFACTOR).

Skill content must answer: What is the purpose? When should it be used? How should Claude use it?

Step 5: Iterate After Deployment

Use skill on real tasks → notice struggles → baseline test again → update → verify.

Claude Search Optimization (CSO)

Description field: Start with "Use when...", use concrete triggers/symptoms, describe the problem not language-specific symptoms, write in third person.

Keywords: Include error messages, symptoms, synonyms, and tool names Claude would search for.

Naming: Use active voice, verb-first, gerunds work well (e.g., creating-skills, testing-async-code).

Token efficiency targets: Frequently-loaded skills <500 words, others <1000 words, getting-started <200 words. Move large content to references, cross-reference other skills, use one excellent example over many mediocre ones.

Testing by Skill Type

Type	Test With	Success Criteria
Discipline	Academic questions + pressure scenarios (3+ combined: time, sunk cost, exhaustion)	Agent follows rule under maximum pressure
Technique	Application scenarios, variations, missing info tests	Agent applies technique to new scenario
Pattern	Recognition, application, counter-examples	Agent knows when/how to apply (and when NOT to)
Reference	Retrieval, application, gap testing	Agent finds and correctly applies information

Skill Creation Checklist

Use TodoWrite to create todos for EACH item.

RED Phase:

Create pressure scenarios (3+ combined pressures for discipline skills)
Run scenarios WITHOUT skill - document baseline verbatim
Identify patterns in rationalizations/failures

GREEN Phase:

Valid frontmatter: name (letters/numbers/hyphens), description ("Use when...", third person, <1024 chars)
Keywords throughout for search (errors, symptoms, tools)
Clear overview with core principle
Address specific baseline failures from RED
One excellent example (not multi-language)
Run scenarios WITH skill - verify compliance

REFACTOR Phase:

Add counters for new rationalizations
Build rationalization table and red flags list
Re-test until bulletproof

Quality Checks:

Overview answers: What? When? How?
Quick reference table or bullets
Common mistakes section
No narrative storytelling
Token count within targets

Deployment:

Commit skill to git
Update documentation if needed

Anti-Patterns

Narrative examples ("In session 2025-10-03...") — too specific, not reusable
Multi-language dilution (example-js, example-py, example-go) — mediocre quality, maintenance burden
Generic labels (helper1, step3) — use semantic names
Untested skills — guarantees issues in production
Vague descriptions ("Helps with coding") — Claude won't know when to activate

Integration with Dev-Workflow

Leverage existing skills: test-driven-development for testing, review for code review, documentation for docs, brainstorming for design exploration.

Activation context: Consider when skill activates in the workflow (planning, implementation, quality) and whether it auto-activates or requires explicit invocation.

Tool restrictions by phase:

Planning: Read, Write, Glob, Grep
Implementation: Read, Write, Edit, Glob, Grep, Bash
Quality: Read, Grep, Glob

The Bottom Line

Creating skills IS TDD for process documentation. Same Iron Law, same cycle (RED → GREEN → REFACTOR), same benefits. If you follow TDD for code, follow it for skills.

skill-maker