skill-test: Claude Code Skill Testing Framework

Automated testing and quality evaluation for Claude Code skills. Helps you discover, test, compare, and select the best skills for your needs.

When to Use This Skill

Use this skill when you want to:

Discover skills: Search for skills related to a specific topic or task
Evaluate quality: Test a skill's output quality, speed, and reliability
Compare options: Benchmark multiple skills side-by-side
Make decisions: Get data-driven recommendations on which skill to use

Trigger Phrases

测试 [topic] 相关skill - Search and test skills related to a topic
测试 [skill-name] - Test a specific skill directly
test [topic] skills - English version of topic search
test [skill-name] - English version of direct test

How It Works

1. Skill Discovery

When you say "测试写文章相关skill":

Local search: Scans installed skills from system context
Remote search: Queries skills.sh via find-skills
Merge & rank: Combines results, removes duplicates, ranks by install count
Present options: Shows a table with skill names, sources, install counts, and descriptions

Example output:

找到以下与「写文章」相关的 skill：

| # | skill 名称 | 来源 | 安装量 | 简介 |
|---|---|---|---|---|
| 1 | humanizer-zh | 本地已安装 | - | 去除 AI 生成痕迹 |
| 2 | latex-thesis-zh | find-skill | 315 | 中文 LaTeX 学术论文写作 |
| 3 | xiaohongshu-converter | find-skill | 269 | 小红书风格文章转换 |

请确认要测试哪些？（可多选，输入编号，如：1 2 3）

2. Test Execution

For each selected skill:

Step 1: Setup

Creates directory: [category]/[skill-name]/
Generates realistic test input: input.md

Step 2: Execution

Invokes the skill using the Skill tool
Captures output and execution time
Handles errors gracefully

Step 3: Evaluation

Saves output to output.*
Generates REPORT.md with:
- 4-dimension scoring (1-5 scale)
- Highlights and weaknesses
- Conclusion and recommendations

Step 4: Summary

Outputs brief test summary
Continues to next skill

3. Report Generation

After all tests complete, generates SUMMARY.md with:

Comparison table: All skills scored side-by-side
Key findings: Major insights from testing
Recommendations: Which skill to use for which scenario
Detailed analysis: Strengths, weaknesses, and trade-offs

Scoring Dimensions

Each skill is scored 1-5 on four dimensions:

Dimension	What It Measures
Output Quality	Accuracy, completeness, and polish of results
Response Speed	Execution time and performance
Instruction Following	How well it adheres to its documented behavior
Practicality	Real-world usefulness and value

Total Score: 20 points maximum

Output Structure

[category]/
├── [skill-name-1]/
│   ├── input.md      ← Generated test prompt
│   ├── output.*      ← Skill output
│   └── REPORT.md     ← Individual evaluation
├── [skill-name-2]/
│   ├── input.md
│   ├── output.*
│   └── REPORT.md
└── SUMMARY.md        ← Comparative analysis

Example Usage

Example 1: Find the best writing skill

Input:

测试 写文章 相关skill

Process:

Finds 8 writing-related skills
User selects 3 to test
Tests each with realistic writing tasks
Generates comparative report

Output:

writing/humanizer-zh/ - Score: 19/20
writing/xiaohongshu-converter/ - Score: 16/20
writing/wechat-converter/ - Score: 18/20
writing/SUMMARY.md - Recommendation: Use humanizer-zh for general use

Example 2: Verify a specific skill

Input:

测试 humanizer-zh

Process:

Skips discovery phase
Generates test case for AI text humanization
Executes skill and captures output
Produces detailed evaluation report

Output:

writing/humanizer-zh/REPORT.md with full analysis

Automation Rules

✅ Directories created automatically
✅ Test inputs generated based on skill purpose
✅ Outputs saved to files (not shown in chat)
✅ Progress updates after each test
✅ Summary generated when all tests complete

Handling Special Cases

Non-invocable Skills

Some skills have user-invocable: false and cannot be called directly.

Behavior: skill-test will:

Read the skill's documentation
Manually apply its instructions
Note the limitation in the report
Score "Response Speed" lower due to manual process

Failed Tests

If a skill fails or errors:

Error is captured in the report
Scoring reflects the failure
Testing continues with remaining skills

Missing Skills

If a skill needs installation:

Prompts user for confirmation
Runs npx skills add [package] -g -y
Proceeds with testing

Best Practices

Test multiple skills: Compare at least 2-3 options for better insights
Review outputs: Check the actual output files, not just scores
Consider context: A lower-scored skill might be better for your specific use case
Read summaries: The SUMMARY.md provides the most valuable insights

Limitations

Each skill tested with only one scenario (may not cover all features)
Scoring has subjective elements based on AI judgment
Non-invocable skills require manual execution (slower, less automated)
Test quality depends on generated input.md relevance

Commands Reference

Command	Action
`测试 [topic] 相关skill`	Discover and test skills for a topic
`测试 [skill-name]`	Test a specific skill
`进度`	Show testing progress
`报告 [skill-name]`	View a skill's report
`对比 [skill1] [skill2]`	Compare two skills

Technical Details

Type: Claude Code Agent
Tools: All tools available
Trigger: User message matching patterns
Execution: Runs as subagent via Agent tool
State: Stateless (each invocation is independent)

Future Enhancements

Multi-scenario testing per skill
Custom test case support
Performance benchmarking
Regression testing for skill updates
Export reports to JSON/CSV

Version: 1.0.0 Last Updated: 2026-03-10 Compatibility: Claude Code CLI