skill_evaluator
Skill Evaluator (WIP)
Evaluates skills against Anthropic's official best practices for agent skill authoring. Produces structured evaluation reports with scores and actionable recommendations.
Quick Start
- Read the skill's SKILL.md and understand its purpose
- Run automated validation:
scripts/validate_skill.py <skill-path> - Perform manual evaluation against criteria below
- Generate evaluation report with scores and recommendations
Evaluation Workflow
Step 1: Automated Validation
Run the validation script first:
scripts/validate_skill.py <path/to/skill>
This checks:
- SKILL.md exists with valid YAML frontmatter
- Name follows conventions (lowercase, hyphens, max 64 chars)
- Description is present and under 1024 chars
- Body is under 500 lines
- File references are one-level deep
Step 2: Manual Evaluation
Evaluate each dimension and assign a score (1-5):
A. Naming (Weight: 10%)
| Score | Criteria |
|---|---|
| 5 | Gerund form (-ing), clear purpose, memorable |
| 4 | Descriptive, follows conventions |
| 3 | Acceptable but could be clearer |
| 2 | Vague or misleading |
| 1 | Violates naming rules |
Rules: Max 64 chars, lowercase + numbers + hyphens only, no reserved words (anthropic, claude), no XML tags.
Good: processing-pdfs, analyzing-spreadsheets, building-dashboards
Bad: pdf, my-skill, ClaudeHelper, anthropic-tools
B. Description (Weight: 20%)
| Score | Criteria |
|---|---|
| 5 | Clear functionality + specific activation triggers + third person |
| 4 | Good description with some triggers |
| 3 | Adequate but missing triggers or vague |
| 2 | Too brief or unclear purpose |
| 1 | Missing or unhelpful |
Must include: What the skill does AND when to use it. Good: "Extracts text from PDFs. Use when working with PDF documents for text extraction, form parsing, or content analysis." Bad: "A skill for PDFs." or "Helps with documents."
C. Content Quality (Weight: 30%)
| Score | Criteria |
|---|---|
| 5 | Concise, assumes Claude intelligence, actionable instructions |
| 4 | Generally good, minor verbosity |
| 3 | Some unnecessary explanations or redundancy |
| 2 | Overly verbose or confusing |
| 1 | Bloated, explains obvious concepts |
Ask: "Does Claude really need this explanation?" Remove anything Claude already knows.
D. Structure & Organization (Weight: 25%)
| Score | Criteria |
|---|---|
| 5 | Excellent progressive disclosure, clear navigation, optimal length |
| 4 | Good organization, appropriate file splits |
| 3 | Acceptable but could be better organized |
| 2 | Poor organization, missing references, or bloated SKILL.md |
| 1 | No structure, everything dumped in SKILL.md |
Check:
- SKILL.md under 500 lines
- References are one-level deep (no nested chains)
- Long reference files (>100 lines) have table of contents
- Uses forward slashes in all paths
E. Degrees of Freedom (Weight: 10%)
| Score | Criteria |
|---|---|
| 5 | Perfect match: high freedom for flexible tasks, low for fragile operations |
| 4 | Generally appropriate freedom levels |
| 3 | Acceptable but could be better calibrated |
| 2 | Mismatched: too rigid or too loose |
| 1 | Completely wrong freedom level for the task type |
Guideline:
- High freedom (text): Multiple valid approaches, context-dependent
- Medium freedom (parameterized): Preferred pattern exists, some variation OK
- Low freedom (specific scripts): Fragile operations, exact sequence required
F. Anti-Pattern Check (Weight: 5%)
Deduct points for each anti-pattern found:
- Too many options without clear recommendation (-1)
- Time-sensitive information with date conditionals (-1)
- Inconsistent terminology (-1)
- Windows-style paths (backslashes) (-1)
- Deeply nested references (more than one level) (-2)
- Scripts that punt error handling to Claude (-1)
- Magic numbers without justification (-1)
Step 3: Generate Report
Use this template:
# Skill Evaluation Report: [skill-name]
## Summary
- **Overall Score**: X.X/5.0
- **Recommendation**: [Ready for publication / Needs minor improvements / Needs major revision]
## Dimension Scores
| Dimension | Score | Weight | Weighted |
|-----------|-------|--------|----------|
| Naming | X/5 | 10% | X.XX |
| Description | X/5 | 20% | X.XX |
| Content Quality | X/5 | 30% | X.XX |
| Structure | X/5 | 25% | X.XX |
| Degrees of Freedom | X/5 | 10% | X.XX |
| Anti-Patterns | X/5 | 5% | X.XX |
| **Total** | | 100% | **X.XX** |
## Strengths
- [List 2-3 things done well]
## Areas for Improvement
- [List specific issues with actionable fixes]
## Anti-Patterns Found
- [List any anti-patterns detected]
## Recommendations
1. [Priority 1 fix]
2. [Priority 2 fix]
3. [Priority 3 fix]
## Pre-Publication Checklist
- [ ] Description is specific with activation triggers
- [ ] SKILL.md under 500 lines
- [ ] One-level-deep file references
- [ ] Forward slashes in all paths
- [ ] No time-sensitive information
- [ ] Consistent terminology
- [ ] Concrete examples provided
- [ ] Scripts handle errors explicitly
- [ ] All configuration values justified
- [ ] Required packages listed
- [ ] Tested with Haiku, Sonnet, Opus
Score Interpretation
| Score Range | Rating | Action |
|---|---|---|
| 4.5 - 5.0 | Excellent | Ready for publication |
| 4.0 - 4.4 | Good | Minor improvements recommended |
| 3.0 - 3.9 | Acceptable | Several improvements needed |
| 2.0 - 2.9 | Needs Work | Major revision required |
| 1.0 - 1.9 | Poor | Fundamental redesign needed |
References
- references/evaluation-criteria.md - Detailed evaluation criteria with examples
- references/scoring-rubric.md - Complete scoring rubric and edge cases
Skill Evaluator v1.1 - Enhanced
🔄 Workflow
Kaynak: Google Engineering Practices - Code Review & Anthropic System Prompts
Aşama 1: Structural Analysis
- Compliance: Dosya yapısı (
scripts/,references/) standarta uyuyor mu? - Metadata: YAML frontmatter (
name,description) eksiksiz ve valid mi? - Modularity: Skill çok mu büyük? Bölünmesi gerekiyor mu? (Single Responsibility Principle).
Aşama 2: Content & Semantic Review
- Clarity: Talimatlar emir kipiyle (Imperative) ve net yazılmış mı? Belirsizlik var mı?
- Context Efficiency: "Gereksiz nezaket" veya "aşırı açıklama" var mı? Token israfı önlenmeli.
- Safety: Skill tehlikeli bir işlem (dosya silme, yetkisiz erişim) öneriyor mu?
Aşama 3: Functionality Verification
- Script Audit:
scripts/içindeki Python/Bash kodları güvenli ve çalışır durumda mı? - Reference Check:
references/dosyaları gerçekten gerekli mi? YoksaSKILL.mdiçine mi gömülmeli? - Usability: Bir kullanıcı (veya ajan) bu skill'i okuyup hemen kullanabilir mi?
Kontrol Noktaları
| Aşama | Doğrulama |
|---|---|
| 1 | Skill adı ve açıklaması birbiriyle tutarlı mı? |
| 2 | Anti-pattern (örn: Hardcoded path) tespit edildi mi? |
| 3 | Puanlama rubriğine göre objektif bir skor (1-5) verildi mi? |
More from vuralserhat86/antigravity-agentic-skills
skill_creator
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
37huggingface_transformers
Hugging Face Transformers best practices including model loading, tokenization, fine-tuning workflows, and inference optimization. Use when working with transformer models, fine-tuning LLMs, implementing NLP tasks, or optimizing transformer inference.
22responsive_design
Build responsive, mobile-first layouts using fluid containers, flexible units, media queries, and touch-friendly design that works across all screen sizes. Use this skill when creating or modifying UI layouts, responsive grids, breakpoint styles, mobile navigation, or any interface that needs to adapt to different screen sizes. Apply when working with responsive CSS, media queries, viewport settings, flexbox/grid layouts, mobile-first styling, breakpoint definitions (mobile, tablet, desktop), touch target sizing, relative units (rem, em, %), image optimization for different screens, or testing layouts across multiple devices. Use for any task involving multi-device support, responsive design patterns, or adaptive layouts.
20cache_patterns
Instruction set for enabling and operating the Spring Cache abstraction in Spring Boot when implementing application-level caching for performance-sensitive workloads.
16zustand_state
Production-tested setup for Zustand state management in React. Includes patterns for persistence, devtools, and TypeScript patterns. Prevents hydration mismatches and render loops.
14vitest_runner
Modern JavaScript/TypeScript testing with Vitest including mocking and coverage.
13