skill-judge

SKILL.md

Skill Judge

Evaluate Agent Skills against official specifications and best practices using an 8-dimensional scoring framework (120 points total).

Decision Tree

Choose evaluation approach based on context:

Evaluation Context Primary Focus When to Use
Quick review (5-10 min) Description + Knowledge Delta Initial screening, triage
Full audit (20-30 min) All 8 dimensions Comprehensive quality assessment
Compliance check (5 min) Frontmatter + Description Format validation only
Improvement guidance (30+ min) All dimensions + detailed feedback Skill optimization

Workflow

Step 1: Load Evaluation Framework

MANDATORY - READ ENTIRE FILE: Before proceeding, you MUST read evaluation-guide.md completely from start to finish. NEVER set any range limits when reading this file.

Step 2: Quick Scan (5 minutes)

Read SKILL.md completely and identify:

  • Skill type: Mindset (~50 lines), Navigation (~30 lines), Philosophy (~150 lines), Process (~200 lines), Tool (~300 lines)
  • Line count: Is it appropriate for the type?
  • Description quality: Does it have WHAT, WHEN, and keywords?
  • Knowledge delta: Any obvious "explaining basics" sections?

Step 3: Dimension Evaluation (15-20 minutes)

Evaluate each dimension in order:

Priority Dimension Points Why This Order
1 D4: Specification Compliance (Description) 15 Poor description = skill never used
2 D1: Knowledge Delta 20 Core dimension - determines value
3 D7: Pattern Recognition 10 Sets expectations for structure
4 D5: Progressive Disclosure 15 Checks if references are used properly
5 D2: Mindset + Procedures 15 Evaluates thinking patterns
6 D3: Anti-Pattern Quality 15 Checks for NEVER lists
7 D6: Freedom Calibration 15 Matches freedom to task fragility
8 D8: Practical Usability 15 Can Agent actually use it?

Step 4: Score Calculation

Sum all dimension scores (max 120 points). Calculate percentage and assign grade:

Score Range Grade Interpretation
96-120 A Excellent - Production ready
84-95 B Good - Minor improvements needed
72-83 C Acceptable - Moderate improvements needed
60-71 D Poor - Significant improvements needed
<60 F Fail - Major redesign required

Step 5: Generate Report

MANDATORY - READ ENTIRE FILE: Before generating report, you MUST read scoring-guide.md completely.

Output structured report in this format:

# Skill Evaluation Report

## Overview
- **Skill**: [skill-name]
- **Type**: [Mindset/Navigation/Philosophy/Process/Tool]
- **Total Score**: [X]/120 ([X]%)
- **Grade**: [A/B/C/D/F]

## Dimension Scores

| Dimension | Score | Max | Notes |
|-----------|-------|-----|-------|
| D1: Knowledge Delta | [X] | 20 | [brief notes] |
| D2: Mindset + Procedures | [X] | 15 | [brief notes] |
| D3: Anti-Pattern Quality | [X] | 15 | [brief notes] |
| D4: Specification Compliance | [X] | 15 | [brief notes] |
| D5: Progressive Disclosure | [X] | 15 | [brief notes] |
| D6: Freedom Calibration | [X] | 15 | [brief notes] |
| D7: Pattern Recognition | [X] | 10 | [brief notes] |
| D8: Practical Usability | [X] | 15 | [brief notes] |

## Critical Issues (Must Fix)
1. [Issue 1]
2. [Issue 2]

## Improvement Suggestions (Should Fix)
1. [Suggestion 1]
2. [Suggestion 2]

## Strengths (Keep)
1. [Strength 1]
2. [Strength 2]

NEVER Do When Evaluating

Scoring Mistakes

  • NEVER give high scores for "professional formatting" alone - content matters most
  • NEVER ignore token waste - every redundant paragraph = deduction
  • NEVER let length impress you - 43-line skill can outperform 500-line skill
  • NEVER assume all procedures are valuable - distinguish domain-specific from generic

Evaluation Mistakes

  • NEVER skip mentally testing decision trees - do they lead to correct choices?
  • NEVER forgive explaining basics with "but it provides helpful context"
  • NEVER overlook missing anti-patterns - no NEVER list = significant gap
  • NEVER undervalue description field - poor description = skill never used

Reporting Mistakes

  • NEVER give vague feedback like "improve quality" - be specific
  • NEVER suggest changes without explaining WHY
  • NEVER provide scores without actionable improvement suggestions

Quick Reference

Knowledge Delta Red Flags (D1)

  • "What is [basic concept]" sections
  • Step-by-step tutorials for standard operations
  • Explaining how to use common libraries
  • Generic best practices ("write clean code")
  • Definitions of industry-standard terms

Knowledge Delta Green Flags (D1)

  • Decision trees for non-obvious choices
  • Trade-offs only experts know
  • Edge cases from real-world experience
  • "NEVER do X because [non-obvious reason]"
  • Domain-specific thinking frameworks

Anti-Pattern Quality (D3)

  • Score 0-3: No anti-patterns mentioned
  • Score 4-7: Generic warnings ("avoid errors")
  • Score 8-11: Specific NEVER list with some reasoning
  • Score 12-15: Expert-grade anti-patterns with WHY

Description Quality (D4)

  • Must answer: WHAT (functionality), WHEN (trigger scenarios), KEYWORDS (searchable terms)
  • Poor: "处理文档相关功能" (vague, no triggers, no keywords)
  • Excellent: "Comprehensive document creation, editing, and analysis. Use when Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying content, (3) Working with tracked changes"

Pattern Recognition (D7)

Pattern ~Lines When to Use
Mindset ~50 Creative tasks requiring taste
Navigation ~30 Multiple distinct sub-scenarios
Philosophy ~150 Art/creation requiring originality
Process ~200 Complex multi-step projects
Tool ~300 Precise operations on specific formats

Freedom Calibration (D6)

  • High freedom: Creative/Design tasks (frontend-design)
  • Medium freedom: Code review, judgment-based tasks
  • Low freedom: File format operations (docx, pdf, xlsx)

Output Format

Always output evaluation report in the structured format shown in Step 5. Include:

  • Overview with total score and grade
  • Dimension scores table with notes
  • Critical issues (must fix)
  • Improvement suggestions (should fix)
  • Strengths (keep)

Do NOT output:

  • Unstructured feedback
  • Scores without explanations
  • Generic comments without specific examples
Weekly Installs
1
GitHub Stars
1
First Seen
14 days ago
Installed on
amp1
cline1
openclaw1
opencode1
cursor1
kimi-cli1