skill-test
SKILL.md
skill-test: Claude Code Skill Testing Framework
Automated testing and quality evaluation for Claude Code skills. Helps you discover, test, compare, and select the best skills for your needs.
When to Use This Skill
Use this skill when you want to:
- Discover skills: Search for skills related to a specific topic or task
- Evaluate quality: Test a skill's output quality, speed, and reliability
- Compare options: Benchmark multiple skills side-by-side
- Make decisions: Get data-driven recommendations on which skill to use
Trigger Phrases
测试 [topic] 相关skill- Search and test skills related to a topic测试 [skill-name]- Test a specific skill directlytest [topic] skills- English version of topic searchtest [skill-name]- English version of direct test
How It Works
1. Skill Discovery
When you say "测试 写文章 相关skill":
- Local search: Scans installed skills from system context
- Remote search: Queries skills.sh via
find-skills - Merge & rank: Combines results, removes duplicates, ranks by install count
- Present options: Shows a table with skill names, sources, install counts, and descriptions
Example output:
找到以下与「写文章」相关的 skill:
| # | skill 名称 | 来源 | 安装量 | 简介 |
|---|---|---|---|---|
| 1 | humanizer-zh | 本地已安装 | - | 去除 AI 生成痕迹 |
| 2 | latex-thesis-zh | find-skill | 315 | 中文 LaTeX 学术论文写作 |
| 3 | xiaohongshu-converter | find-skill | 269 | 小红书风格文章转换 |
请确认要测试哪些?(可多选,输入编号,如:1 2 3)
2. Test Execution
For each selected skill:
Step 1: Setup
- Creates directory:
[category]/[skill-name]/ - Generates realistic test input:
input.md
Step 2: Execution
- Invokes the skill using the Skill tool
- Captures output and execution time
- Handles errors gracefully
Step 3: Evaluation
- Saves output to
output.* - Generates
REPORT.mdwith:- 4-dimension scoring (1-5 scale)
- Highlights and weaknesses
- Conclusion and recommendations
Step 4: Summary
- Outputs brief test summary
- Continues to next skill
3. Report Generation
After all tests complete, generates SUMMARY.md with:
- Comparison table: All skills scored side-by-side
- Key findings: Major insights from testing
- Recommendations: Which skill to use for which scenario
- Detailed analysis: Strengths, weaknesses, and trade-offs
Scoring Dimensions
Each skill is scored 1-5 on four dimensions:
| Dimension | What It Measures |
|---|---|
| Output Quality | Accuracy, completeness, and polish of results |
| Response Speed | Execution time and performance |
| Instruction Following | How well it adheres to its documented behavior |
| Practicality | Real-world usefulness and value |
Total Score: 20 points maximum
Output Structure
[category]/
├── [skill-name-1]/
│ ├── input.md ← Generated test prompt
│ ├── output.* ← Skill output
│ └── REPORT.md ← Individual evaluation
├── [skill-name-2]/
│ ├── input.md
│ ├── output.*
│ └── REPORT.md
└── SUMMARY.md ← Comparative analysis
Example Usage
Example 1: Find the best writing skill
Input:
测试 写文章 相关skill
Process:
- Finds 8 writing-related skills
- User selects 3 to test
- Tests each with realistic writing tasks
- Generates comparative report
Output:
writing/humanizer-zh/- Score: 19/20writing/xiaohongshu-converter/- Score: 16/20writing/wechat-converter/- Score: 18/20writing/SUMMARY.md- Recommendation: Use humanizer-zh for general use
Example 2: Verify a specific skill
Input:
测试 humanizer-zh
Process:
- Skips discovery phase
- Generates test case for AI text humanization
- Executes skill and captures output
- Produces detailed evaluation report
Output:
writing/humanizer-zh/REPORT.mdwith full analysis
Automation Rules
- ✅ Directories created automatically
- ✅ Test inputs generated based on skill purpose
- ✅ Outputs saved to files (not shown in chat)
- ✅ Progress updates after each test
- ✅ Summary generated when all tests complete
Handling Special Cases
Non-invocable Skills
Some skills have user-invocable: false and cannot be called directly.
Behavior: skill-test will:
- Read the skill's documentation
- Manually apply its instructions
- Note the limitation in the report
- Score "Response Speed" lower due to manual process
Failed Tests
If a skill fails or errors:
- Error is captured in the report
- Scoring reflects the failure
- Testing continues with remaining skills
Missing Skills
If a skill needs installation:
- Prompts user for confirmation
- Runs
npx skills add [package] -g -y - Proceeds with testing
Best Practices
- Test multiple skills: Compare at least 2-3 options for better insights
- Review outputs: Check the actual output files, not just scores
- Consider context: A lower-scored skill might be better for your specific use case
- Read summaries: The SUMMARY.md provides the most valuable insights
Limitations
- Each skill tested with only one scenario (may not cover all features)
- Scoring has subjective elements based on AI judgment
- Non-invocable skills require manual execution (slower, less automated)
- Test quality depends on generated input.md relevance
Commands Reference
| Command | Action |
|---|---|
测试 [topic] 相关skill |
Discover and test skills for a topic |
测试 [skill-name] |
Test a specific skill |
进度 |
Show testing progress |
报告 [skill-name] |
View a skill's report |
对比 [skill1] [skill2] |
Compare two skills |
Technical Details
- Type: Claude Code Agent
- Tools: All tools available
- Trigger: User message matching patterns
- Execution: Runs as subagent via Agent tool
- State: Stateless (each invocation is independent)
Future Enhancements
- Multi-scenario testing per skill
- Custom test case support
- Performance benchmarking
- Regression testing for skill updates
- Export reports to JSON/CSV
Version: 1.0.0 Last Updated: 2026-03-10 Compatibility: Claude Code CLI
Weekly Installs
3
Repository
hardydou/hardy-skillFirst Seen
4 days ago
Security Audits
Installed on
trae-cn3
gemini-cli3
antigravity3
junie3
windsurf3
claude-code3