skill-evaluator
Skill Evaluator
Evaluate Agent Skills against agentskills.io specification and best practices.
Core Philosophy
Good Skill = Expert-only Knowledge − What the LLM Already Knows
| Knowledge Type | Treatment | Example |
|---|---|---|
| Expert (LLM doesn't know) | Keep — this is value | "mediabox not cropbox for PDF size" |
| Activation (LLM may forget) | Keep if brief | "Always validate XML before packing" |
| Redundant (LLM knows) | Delete — token waste | "What is a PDF file" |
Before Evaluating, Ask Yourself
- "Does this skill capture knowledge that took someone years to learn?" — If no, low D1 score.
- "Would an expert read this and nod, or roll their eyes?" — Eye-roll = redundant content.
- "After reading, can the LLM do something it couldn't before?" — If just faster/reminded, marginal value.
Mode Selection
| Mode | When to Use | Time | Input | Output |
|---|---|---|---|---|
| Static | Quick check, bulk screening, first pass | ~2 min | SKILL.md only | /60 |
| Semi-Static | Install decision, fit check | ~5 min | + Environment/User info | /100 |
| Full | Production deploy, security audit | ~15 min | + Complete package | /130 |
Default workflow: Static first → If score >60 AND installation considered → Semi-Static → If deploying to production → Full.
Mode 1: Static Analysis
Input: SKILL.md only
Output: Gate check (Pass/Fail) + Quality Score /60
Gate Check (All Must Pass)
Any gate failure = immediate reject. Do not proceed to scoring.
| Gate | Requirement | Common Failures |
|---|---|---|
| G1: YAML | Valid frontmatter with name + description |
Missing ---, no description |
| G2: Name | 1-64 chars, [a-z0-9-] only, no reserved words |
Uppercase, claude-, anthropic- |
| G3: Description | 1-1024 chars, third-person, no placeholders | Uses "I/you/my", contains "TODO" |
Quality Dimensions (60 points)
D1: Knowledge Delta (20 pts) — MOST IMPORTANT
"If I delete this, would the LLM perform noticeably worse?"
| Score | Indicator |
|---|---|
| 16-20 | Pure expert knowledge: trade-offs, decision trees, non-obvious sequences, "NEVER X because [surprising reason]" |
| 11-15 | Mostly expert: some activation knowledge mixed in |
| 6-10 | Mixed: useful bits buried in tutorials |
| 0-5 | Redundant: docs the LLM already knows, "What is X" sections |
Red flags (subtract points):
- Library tutorials ("How to use pandas")
- Generic best practices ("Write clean code")
- Definitions the LLM knows ("PDF is Portable Document Format")
Green flags (add points):
- "When X and Y, choose Z because..."
- "NEVER do X — it causes [non-obvious problem]"
- Specific numbers/thresholds ("Scale UP not down for text clarity")
- Domain-specific sequences the LLM would get wrong
D2: Mindset + Procedures (15 pts)
| Score | Indicator |
|---|---|
| 12-15 | Shapes thinking: "Before X, ask yourself...", expert decision frameworks |
| 8-11 | Domain workflows + some thinking patterns |
| 4-7 | Procedures only, no mental models |
| 0-3 | Generic steps ("1. Open file 2. Process 3. Save") |
Look for: Diagnostic questions, priority rules, "The expert's first question is always..."
D3: Anti-Patterns (10 pts)
| Score | Indicator |
|---|---|
| 9-10 | Comprehensive NEVER list with non-obvious reasons |
| 6-8 | Specific warnings, some reasons |
| 3-5 | Vague warnings ("Be careful with...") |
| 0-2 | No anti-patterns mentioned |
What counts: Specific, actionable, with surprising consequences. "NEVER use Inter font — dead giveaway of AI-generated" beats "Choose fonts carefully."
D4: Structure & Economy (15 pts)
| Score | Indicator |
|---|---|
| 12-15 | <300 lines, excellent progressive disclosure, clear {baseDir} references |
| 8-11 | 300-500 lines, uses references/, has loading triggers |
| 4-7 | 500-800 lines, some structure |
| 0-3 | >800 lines, monolithic, no disclosure |
Check for:
{baseDir}/references/paths with clear "when to load" instructions- "MANDATORY if [condition]: Read..." triggers
- "Do NOT preload" markers for large references
Static report template: See {baseDir}/references/templates.md#static-report
Mode 2: Semi-Static Analysis
Input: SKILL.md + Target Environment + User Level
Output: Static score + Fit scores = /100
Step 1: Complete Static Analysis
Run Mode 1 first. If gates fail, stop.
Step 2: Environment Fit (20 pts)
| Environment | Shell | Files | Network | scripts/ |
|---|---|---|---|---|
| claude.ai/Web | ❌ | Upload only | ❌ | ❌ |
| Claude Desktop | ⚠️ | ⚠️ | ❌ | ⚠️ |
| Coding Agent/CLI | ✅ | ✅ Workspace | ✅ | ✅ |
| IDE Extension | ✅ | ✅ Workspace | ⚠️ | ✅ |
| Enterprise | Policy | Policy | Policy | Policy |
| Score | Criteria |
|---|---|
| 16-20 | Fully compatible, all features work |
| 11-15 | Mostly works, minor limitations |
| 6-10 | Partial, requires workarounds or degraded mode |
| 0-5 | Incompatible, core features won't work |
Step 3: User Fit (20 pts)
| User Level | Signals | Skill Should Provide |
|---|---|---|
| Novice | "I'm new to...", asks basics | Guided workflows, examples, guardrails |
| Intermediate | Uses terms correctly, asks "how to optimize" | Efficiency, concepts, options |
| Expert | Asks about internals, wants customization | Control, extensibility, raw power |
| Score | Criteria |
|---|---|
| 16-20 | Perfect audience fit |
| 11-15 | Good fit, acceptable learning curve |
| 6-10 | Partial fit, friction expected |
| 0-5 | Wrong audience entirely |
Semi-static report template: See {baseDir}/references/templates.md#semi-static-report
Mode 3: Full Analysis
Input: Complete skill package + Test environment
Output: Comprehensive score /130
Step 1: Complete Semi-Static
Run Modes 1 and 2 first.
Step 2: Security Scan (Gate — Must Pass)
MANDATORY: Read {baseDir}/references/security-scan-spec.md before scanning.
Run static scan:
python {baseDir}/scripts/security_scan.py /path/to/skill
# or: node {baseDir}/scripts/security_scan.js /path/to/skill
# or: bash {baseDir}/scripts/security_scan.sh /path/to/skill
Any HIGH severity finding = instant fail. Do not proceed.
For semantic analysis (obfuscation, prompt injection, data flow): Read {baseDir}/references/security-scan-llm.md and use the LLM scan prompt.
Step 3: Trigger Testing (10 pts)
Test 5 prompt types:
| Type | Example | Expected |
|---|---|---|
| Direct | "Use [skill-name] to..." | Trigger |
| Keyword | "[feature word] my file" | Trigger |
| Indirect | "[Problem the skill solves]" | Trigger |
| Ambiguous | Vague related request | Maybe trigger |
| Negative | Unrelated task | NOT trigger |
| Score | Criteria |
|---|---|
| 9-10 | Reliable triggers, zero false positives |
| 7-8 | Usually triggers, rare false positives |
| 4-6 | Sometimes triggers, some false positives |
| 0-3 | Unreliable or excessive false positives |
Step 4: Functional Tests (20 pts)
| Category | Points | What to Test |
|---|---|---|
| Happy path | /8 | Core use cases work correctly |
| Edge cases | /6 | Unusual inputs, boundary conditions |
| Error handling | /3 | Graceful failures, helpful messages |
| Output quality | /3 | Results match expert expectations |
Full report template: See {baseDir}/references/templates.md#full-report
NEVER Do (Evaluator Anti-Patterns)
-
NEVER score D1 high for tutorials — "How to use library X" is not expert knowledge, even if well-written.
-
NEVER ignore token cost — An 800-line skill that could be 200 lines is wasting 75% of context window.
-
NEVER pass security for "educational" shell=True — Legitimate purposes don't justify vulnerabilities.
-
NEVER assume environment — A skill perfect for CLI is worthless in claude.ai web.
-
NEVER conflate "comprehensive" with "good" — More content ≠ more value. Density matters.
-
NEVER skip gate checks — A skill with invalid YAML shouldn't get quality scores.
Common Failure Patterns → Fixes
| Pattern | Symptom | Root Cause | Fix |
|---|---|---|---|
| Tutorial | Low D1, "What is X" sections | Author wrote for humans, not LLMs | Delete all content LLM already knows |
| Dump | >800 lines, no structure | No progressive disclosure | Split to references/, add loading triggers |
| Orphan References | references/ exists but never loaded |
Missing "when to read" instructions | Add explicit "MANDATORY if [X]: Read..." |
| Invisible | Never triggers | Bad description | Move ALL trigger info to description, add keywords |
| Wrong Location | "When to Use" in body | Body loads AFTER trigger decision | Description = when, Body = how |
| Vague Warnings | "Be careful with X" | No actionable anti-patterns | Specific NEVER + surprising consequence |
Quick Reference: Scoring Cheatsheet
D1 (Knowledge Delta): "Would deleting this make LLM worse?"
D2 (Mindset): "Does it shape HOW to think, not just WHAT to do?"
D3 (Anti-Patterns): "Specific NEVERs with surprising reasons?"
D4 (Structure): "<300 lines? Progressive disclosure? Loading triggers?"
Environment: "Will core features actually work?"
User: "Right audience? Right complexity level?"
The Meta-Question
After every evaluation, ask:
"Would an expert in this domain say: 'Yes, this captures knowledge that took me years to learn'?"
- Yes → The skill has value.
- No → The skill is compressing what the LLM already knows.
Bento-Ready Report Output
For visual dashboard integration, generate JSON reports where evaluation logic directly determines visual weight.
Core Principles
异常放大,正常收敛 — Information density scales with deviation severity.
Speak human, not framework — Users haven't read our evaluation docs. No D1/D2/G1 jargon.
| Status | Display Strategy |
|---|---|
| Normal score (≥80%) | Compact block, headline only |
| Notable issue (<60%) | Expanded block with detail + action |
| Critical issue (<30% or security) | Prominent block with full evidence |
Output Formats
| Request | Output |
|---|---|
| Standard evaluation | Markdown report (see {baseDir}/references/templates.md) |
| "bento report" / "visual report" / "JSON report" | Bento-ready JSON |
Bento Report Structure
{
"meta": { "skillName": "...", "totalScore": 48, "maxScore": 60 },
"summary": { "verdict": "good", "oneLiner": "Ready to use with minor improvements." },
"blocks": [
{
"id": "expert-knowledge",
"type": "score",
"importance": { "level": "normal", "reason": "Good expert content with minor redundancy" },
"layout": { "size": "default" },
"content": {
"headline": "Expert Knowledge — Good ✓",
"detail": "Contains valuable professional insights. Some basic tutorials could be trimmed."
}
}
]
}
Importance → Layout Mapping
| Importance | Layout | Trigger |
|---|---|---|
critical |
prominent |
Gate fail, security issue, score <30% |
notable |
expanded |
Score <60%, outlier disparity |
normal |
default |
Score 60-80% |
minor |
compact |
Score ≥80%, all gates pass |
Visual Assets
Add visual_asset for critical/notable blocks:
- Charts: gauge (header), radar (quality overview)
- Illustrations: warning style for security issues
- Badges: verdict display
Always include fallback.text for image generation failures.
MANDATORY for bento reports: Read {baseDir}/references/bento-report-schema.md for complete JSON schema.
MANDATORY for bento generation: Read {baseDir}/references/bento-report-instruction.md for generation rules.