skill-forge-review
Installation
SKILL.md
Skill Review & Validation
Process
Step 1: Locate Skill Files
Accept input as:
- Path to a skill directory
- Skill name (search in
~/.claude/skills/) - URL to a GitHub repository
Read all .md files, scripts, and asset files.
Step 2: Structure Validation
Run python scripts/validate_skill.py <path> for programmatic checks.
Manual verification:
- SKILL.md exists (exact case)
- No README.md inside skill folder
- Folder name matches
namefield - Valid kebab-case naming (1-64 chars)
- No "claude" or "anthropic" in name
Step 3: Frontmatter Audit
| Check | Pass Criteria |
|---|---|
| Name format | kebab-case, 1-64 chars, no leading/trailing hyphens |
| Description present | Non-empty, 1-1024 characters |
| Description has WHAT | Explains capabilities |
| Description has WHEN | Includes trigger phrases |
| Description has keywords | Domain-specific terms included |
| No XML tags | No < or > characters |
| Optional fields valid | license, compatibility (<500 chars), metadata |
Step 4: Triggering Analysis
Assess the description for activation quality:
Under-triggering risks:
- Too generic ("Helps with projects")
- Missing common paraphrases
- No domain keywords
- Missing file type mentions (if relevant)
Over-triggering risks:
- Too broad ("Processes documents")
- Overlaps with built-in Claude capabilities
- Missing negative triggers for disambiguation
Generate test queries:
- 5 queries that SHOULD trigger the skill
- 5 queries that SHOULD NOT trigger
- 3 edge cases (ambiguous queries)
Step 5: Instruction Quality
| Criterion | Score (0-10) |
|---|---|
| Specificity | Are instructions actionable? (not "validate properly") |
| Completeness | All workflows covered? |
| Error handling | Common failures addressed? |
| Examples | Concrete examples provided? |
| Progressive disclosure | Detailed docs in references/ not SKILL.md? |
| Length | Under 500 lines / 5000 tokens? |
| Cross-references | Clear links to references/scripts? |
Step 6: Architecture Review (Multi-skill)
For skills with sub-skills:
- Main skill has clear routing table
- Sub-skills have focused responsibilities
- Cross-references are valid (files exist)
- Naming follows
parent-childconvention - Shared references in parent, not duplicated
- Agents have clear roles (if Tier 4)
Step 7: Script Quality (if present)
- Docstrings with purpose, input, output
- CLI interface (argparse or similar)
- Structured output (JSON)
- Error handling (try/except with clear messages)
- No hardcoded paths or secrets
- Minimal dependencies
Step 8: Generate Skill Health Score
Scoring methodology (0-100):
| Category | Weight | Checks |
|---|---|---|
| Frontmatter Quality | 25% | Name, description, format |
| Trigger Accuracy | 20% | WHAT + WHEN + keywords |
| Instruction Quality | 25% | Specificity, completeness, examples |
| Structure Compliance | 15% | File naming, organization, references |
| Script Quality | 10% | If applicable (full marks if no scripts needed) |
| Progressive Disclosure | 5% | Proper use of 3-level system |
Step 9: Generate Trigger Eval Set
After reviewing, generate a structured trigger eval set for ongoing testing:
- Run
python scripts/generate_eval_set.py <path>to auto-generate a starter set - Review and refine the generated queries:
- Ensure 8-10 should-trigger queries cover different phrasings and edge cases
- Ensure 8-10 should-not-trigger queries are near-misses (not obviously irrelevant)
- Include casual speech, typos, and uncommon domain uses in should-trigger set
- Save the eval set to
evals/evals.jsonin the skill directory
Good queries are realistic and specific (include file paths, context, domain details). Bad queries are overly generic ("format this data") or obviously irrelevant.
- Run
python scripts/optimize_description.py <path> --eval-set evals/evals.jsonto score the current description and get improvement suggestions - Recommend running
/skill-forge eval <path>for full functional evaluation
Step 10: Generate Report
# Skill Review: [name]
## Health Score: [X]/100
## Summary
[2-3 sentence assessment]
## Scores by Category
| Category | Score | Notes |
|----------|-------|-------|
| Frontmatter | X/25 | [issues] |
| Triggering | X/20 | [issues] |
| Instructions | X/25 | [issues] |
| Structure | X/15 | [issues] |
| Scripts | X/10 | [issues] |
| Disclosure | X/5 | [issues] |
## Critical Issues (fix immediately)
- [issue 1]
- [issue 2]
## High Priority (fix within 1 week)
- [issue 1]
## Recommendations
- [suggestion 1]
- [suggestion 2]
## Suggested Test Queries
### Should Trigger
1. [query]
2. [query]
3. [query]
### Should NOT Trigger
1. [query]
2. [query]
3. [query]