skill-judge
Skill Judge
Evaluate Agent Skills against official specifications and best practices using an 8-dimensional scoring framework (120 points total).
Decision Tree
Choose evaluation approach based on context:
| Evaluation Context | Primary Focus | When to Use |
|---|---|---|
| Quick review (5-10 min) | Description + Knowledge Delta | Initial screening, triage |
| Full audit (20-30 min) | All 8 dimensions | Comprehensive quality assessment |
| Compliance check (5 min) | Frontmatter + Description | Format validation only |
| Improvement guidance (30+ min) | All dimensions + detailed feedback | Skill optimization |
Workflow
Step 1: Load Evaluation Framework
MANDATORY - READ ENTIRE FILE: Before proceeding, you MUST read evaluation-guide.md completely from start to finish. NEVER set any range limits when reading this file.
Step 2: Quick Scan (5 minutes)
Read SKILL.md completely and identify:
- Skill type: Mindset (~50 lines), Navigation (~30 lines), Philosophy (~150 lines), Process (~200 lines), Tool (~300 lines)
- Line count: Is it appropriate for the type?
- Description quality: Does it have WHAT, WHEN, and keywords?
- Knowledge delta: Any obvious "explaining basics" sections?
Step 3: Dimension Evaluation (15-20 minutes)
Evaluate each dimension in order:
| Priority | Dimension | Points | Why This Order |
|---|---|---|---|
| 1 | D4: Specification Compliance (Description) | 15 | Poor description = skill never used |
| 2 | D1: Knowledge Delta | 20 | Core dimension - determines value |
| 3 | D7: Pattern Recognition | 10 | Sets expectations for structure |
| 4 | D5: Progressive Disclosure | 15 | Checks if references are used properly |
| 5 | D2: Mindset + Procedures | 15 | Evaluates thinking patterns |
| 6 | D3: Anti-Pattern Quality | 15 | Checks for NEVER lists |
| 7 | D6: Freedom Calibration | 15 | Matches freedom to task fragility |
| 8 | D8: Practical Usability | 15 | Can Agent actually use it? |
Step 4: Score Calculation
Sum all dimension scores (max 120 points). Calculate percentage and assign grade:
| Score Range | Grade | Interpretation |
|---|---|---|
| 96-120 | A | Excellent - Production ready |
| 84-95 | B | Good - Minor improvements needed |
| 72-83 | C | Acceptable - Moderate improvements needed |
| 60-71 | D | Poor - Significant improvements needed |
| <60 | F | Fail - Major redesign required |
Step 5: Generate Report
MANDATORY - READ ENTIRE FILE: Before generating report, you MUST read scoring-guide.md completely.
Output structured report in this format:
# Skill Evaluation Report
## Overview
- **Skill**: [skill-name]
- **Type**: [Mindset/Navigation/Philosophy/Process/Tool]
- **Total Score**: [X]/120 ([X]%)
- **Grade**: [A/B/C/D/F]
## Dimension Scores
| Dimension | Score | Max | Notes |
|-----------|-------|-----|-------|
| D1: Knowledge Delta | [X] | 20 | [brief notes] |
| D2: Mindset + Procedures | [X] | 15 | [brief notes] |
| D3: Anti-Pattern Quality | [X] | 15 | [brief notes] |
| D4: Specification Compliance | [X] | 15 | [brief notes] |
| D5: Progressive Disclosure | [X] | 15 | [brief notes] |
| D6: Freedom Calibration | [X] | 15 | [brief notes] |
| D7: Pattern Recognition | [X] | 10 | [brief notes] |
| D8: Practical Usability | [X] | 15 | [brief notes] |
## Critical Issues (Must Fix)
1. [Issue 1]
2. [Issue 2]
## Improvement Suggestions (Should Fix)
1. [Suggestion 1]
2. [Suggestion 2]
## Strengths (Keep)
1. [Strength 1]
2. [Strength 2]
NEVER Do When Evaluating
Scoring Mistakes
- NEVER give high scores for "professional formatting" alone - content matters most
- NEVER ignore token waste - every redundant paragraph = deduction
- NEVER let length impress you - 43-line skill can outperform 500-line skill
- NEVER assume all procedures are valuable - distinguish domain-specific from generic
Evaluation Mistakes
- NEVER skip mentally testing decision trees - do they lead to correct choices?
- NEVER forgive explaining basics with "but it provides helpful context"
- NEVER overlook missing anti-patterns - no NEVER list = significant gap
- NEVER undervalue description field - poor description = skill never used
Reporting Mistakes
- NEVER give vague feedback like "improve quality" - be specific
- NEVER suggest changes without explaining WHY
- NEVER provide scores without actionable improvement suggestions
Quick Reference
Knowledge Delta Red Flags (D1)
- "What is [basic concept]" sections
- Step-by-step tutorials for standard operations
- Explaining how to use common libraries
- Generic best practices ("write clean code")
- Definitions of industry-standard terms
Knowledge Delta Green Flags (D1)
- Decision trees for non-obvious choices
- Trade-offs only experts know
- Edge cases from real-world experience
- "NEVER do X because [non-obvious reason]"
- Domain-specific thinking frameworks
Anti-Pattern Quality (D3)
- Score 0-3: No anti-patterns mentioned
- Score 4-7: Generic warnings ("avoid errors")
- Score 8-11: Specific NEVER list with some reasoning
- Score 12-15: Expert-grade anti-patterns with WHY
Description Quality (D4)
- Must answer: WHAT (functionality), WHEN (trigger scenarios), KEYWORDS (searchable terms)
- Poor: "处理文档相关功能" (vague, no triggers, no keywords)
- Excellent: "Comprehensive document creation, editing, and analysis. Use when Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying content, (3) Working with tracked changes"
Pattern Recognition (D7)
| Pattern | ~Lines | When to Use |
|---|---|---|
| Mindset | ~50 | Creative tasks requiring taste |
| Navigation | ~30 | Multiple distinct sub-scenarios |
| Philosophy | ~150 | Art/creation requiring originality |
| Process | ~200 | Complex multi-step projects |
| Tool | ~300 | Precise operations on specific formats |
Freedom Calibration (D6)
- High freedom: Creative/Design tasks (frontend-design)
- Medium freedom: Code review, judgment-based tasks
- Low freedom: File format operations (docx, pdf, xlsx)
Output Format
Always output evaluation report in the structured format shown in Step 5. Include:
- Overview with total score and grade
- Dimension scores table with notes
- Critical issues (must fix)
- Improvement suggestions (should fix)
- Strengths (keep)
Do NOT output:
- Unstructured feedback
- Scores without explanations
- Generic comments without specific examples
More from within-7/minto-plugin-tools
html-presentation-beautifier
Transform documents, reports, and data into professional McKinsey-style HTML presentations with intelligent chart selection and interactive navigation. Use when: (1) Creating presentations from documents/reports, (2) Converting markdown/text to slides, (3) Generating HTML slides, (4) Applying McKinsey/BCG design, (5) Data visualization in presentations. Keywords: presentation, slides, HTML, McKinsey style, charts, visualization, 幻灯片, 演示文稿
40feishu-integration
飞书(Feishu/Lark)API集成指南。当用户要求"创建飞书应用"、"管理多维表格"、"添加协作者"、"生成飞书报表"、"设置飞书权限"或"自动化飞书操作"时使用。优先使用 MCP 工具进行实时交互操作。
25beauty-json
Convert HTML slide templates to JSON+HTML format. Invoke when generating JSON data for beauty-normal command or converting existing HTML examples to JSON-driven templates.
18skills-docx
Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. When Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks
11skill best practices
This skill should be used when the user asks to "create a skill", "write SKILL.md", "what are skill best practices", "how to optimize a skill", "improve skill quality", or mentions skill development, trigger phrases, skill structure, or content quality. Provides comprehensive guidance for creating high-quality Claude Code skills following latest standards.
11echarts-chart
|
1