Agent Evaluation Skill

Objective, evidence-based quality assessment for agents and skills. Implements a 6-phase rubric: Identify, Structural, Content, Code, Integration, Report. Every finding must cite a file path and line number — no subjective "looks good" verdicts.

Reference Loading Table

Signal	Load These Files	Why
tasks related to this reference	`batch-evaluation.md`	Loads detailed guidance from `batch-evaluation.md`.
tasks related to this reference	`common-issues.md`	Loads detailed guidance from `common-issues.md`.
tasks related to this reference	`report-templates.md`	Loads detailed guidance from `report-templates.md`.
tasks related to this reference	`scoring-rubric.md`	Loads detailed guidance from `scoring-rubric.md`.

Instructions

Phase 1: Identify Evaluation Targets

Goal: Determine what to evaluate and confirm targets exist.

Read the repository CLAUDE.md first to understand current standards before evaluating anything. Only evaluate what was explicitly requested — do not speculatively analyze additional agents or skills.

# List all agents
ls agents/*.md | wc -l

# List all skills
ls -d skills/*/ | wc -l

# Verify specific target
ls agents/{name}.md
ls -la skills/{name}/

Gate: All targets confirmed to exist on disk. Proceed only when gate passes.

Phase 2: Structural Validation

Goal: Check that required components exist and are well-formed.

Score every rubric category — never skip a category even if it "looks fine." Parse each required field explicitly rather than eyeballing YAML. Record PASS/FAIL with the line number for each check.

Run score-component.py to get deterministic PASS/FAIL for all structural checks. The script implements the full ADR-031 rubric (frontmatter, operator context, error handling, referenced files, anti-patterns) and outputs per-check results with line references. Do not re-implement these checks inline — read the JSON output and move directly to scoring.

# Deterministic structural checks via score-component.py
python3 scripts/score-component.py agents/{name}.md --json
# or for a skill:
python3 scripts/score-component.py skills/{name}/SKILL.md --json

The JSON output includes results[0].checks (per-check status, earned_points, max_points, detail) and results[0].total (aggregate score). Record each check status from the JSON — do not re-run grep -c for sections the script already covers.

What score-component.py covers (do not duplicate):

YAML frontmatter fields
Operator Context section presence
Error Handling section presence
Anti-Patterns section presence
Referenced file existence
Inline constraint presence

What requires LLM judgment in Phase 3+ (not covered by the script):

Operator Context item counts (Hardcoded 5-8, Default 5-8, Optional 3-5)
allowed-tools list format vs comma-separated string
description pipe format with WHAT + WHEN + negative constraint
version set to 2.0.0
Gate presence in Instructions section
CAN/CANNOT boundaries section
Anti-rationalization table in References section

Structural Scoring (60 points):

Component	Points	Requirement
YAML front matter	10	All required fields, list format, pipe description
Operator Context	20	All 3 behavior types with correct item counts
Error Handling	10	Section present with documented errors
Examples (agents) / References (skills)	10	3+ examples or 2+ reference files
CAN/CANNOT	5	Both sections present with concrete items
Anti-Patterns	5	3-5 domain-specific patterns with 3-part structure

Integration Scoring (10 points):

Component	Points	Requirement
References and cross-references	5	Shared patterns linked, all refs resolve
Tool and link consistency	5	allowed-tools matches usage, anti-rationalization table present

See references/scoring-rubric.md for full/partial/no credit breakdowns.

Gate: All structural checks scored with evidence. Proceed only when gate passes.

Phase 3: Content Depth Analysis

Goal: Measure content quality and volume.

Do not estimate length by impression — count lines and calculate the score. "Content is long enough" is not a measurement.

# Skill total lines (SKILL.md + references)
skill_lines=$(wc -l < skills/{name}/SKILL.md)
ref_lines=$(cat skills/{name}/references/*.md 2>/dev/null | wc -l)
total=$((skill_lines + ref_lines))

# Agent total lines
agent_lines=$(wc -l < agents/{name}.md)

Depth Scoring (30 points max):

Total Lines	Score	Grade
>1500 (skills) / >2000 (agents)	30	EXCELLENT
500-1500 / 1000-2000	22	GOOD
300-500 / 500-1000	15	ADEQUATE
150-300 / 200-500	8	THIN
<150 / <200	0	INSUFFICIENT

Gate: Depth score calculated. Proceed only when gate passes.

Phase 4: Code Quality Checks

Goal: Validate that code examples and scripts are functional.

A script existing on disk does not mean it works — run python3 -m py_compile on every .py file. Search for placeholder text in every file, not just files that "look incomplete."

Script syntax: Run python3 -m py_compile on all .py files
Placeholder detection: Search for [TODO], [TBD], [PLACEHOLDER], [INSERT]
Code block tagging: Count untagged (bare ```) vs tagged (```language) blocks

# Python syntax check
# Syntax-check any .py scripts found in the skill's scripts/ directory
python3 -m py_compile scripts/*.py 2>/dev/null

# Placeholder search
grep -nE '\[TODO\]|\[TBD\]|\[PLACEHOLDER\]|\[INSERT\]' {file}

# Untagged code blocks
grep -c '```$' {file}

Gate: All code checks complete. Proceed only when gate passes.

Phase 5: Integration Verification

Goal: Confirm cross-references and tool declarations are consistent.

Reference Resolution:

Extract all referenced files from SKILL.md (grep for references/)
Verify each reference exists on disk
Check shared pattern links resolve (../shared-patterns/)

Tool Consistency:

Parse allowed-tools from YAML front matter
Scan instructions for tool usage (Read, Write, Edit, Bash, Grep, Glob, Task, WebSearch)
Flag any tool used in instructions but not declared in allowed-tools
Flag any tool declared but never used in instructions

Anti-Rationalization Table:

Check that References section links to anti-rationalization-core.md
Verify domain-specific anti-rationalization table is present
Table should have 3-5 rows specific to the skill's domain

# Check referenced files exist
grep -oE 'references/[a-z-]+\.md' skills/{name}/SKILL.md | while read ref; do
  ls "skills/{name}/$ref" 2>/dev/null || echo "MISSING: $ref"
done

# Check tool consistency
grep "allowed-tools:" skills/{name}/SKILL.md
grep -oE '(Read|Write|Edit|Bash|Grep|Glob|Task|WebSearch)' skills/{name}/SKILL.md | sort -u

# Check anti-rationalization reference
grep -c "anti-rationalization-core" skills/{name}/SKILL.md

Gate: All integration checks complete. Proceed only when gate passes.

Phase 6: Generate Quality Report

Goal: Compile all findings into the standard report format.

Show all test results with individual scores — never summarize as "all tests pass." Sort findings by impact (HIGH / MEDIUM / LOW). Include specific, actionable recommendations with file paths and line numbers. When batch evaluating, show how each item compares to collection averages; do not report "most are good quality" without quantitative data.

This phase is read-only: report findings but never modify agents or skills. Use skill-creator for fixes. Clean up any intermediate analysis files created during evaluation.

Use the report template from references/report-templates.md. The report MUST include:

Header: Name, type, date, overall score and grade
Structural Validation: Table with check, status, score, and evidence (line numbers)
Content Depth: Line counts for main file and references, grade, depth score
Code Quality: Script syntax results, placeholder count, untagged block count
Issues Found: Grouped by HIGH / MEDIUM / LOW priority
Recommendations: Specific, actionable improvements with file paths and line numbers
Comparison: Score vs collection average (if batch evaluating)

Issue Priority Classification:

Priority	Criteria	Examples
HIGH	Missing required section or broken functionality	No Operator Context, syntax errors in scripts
MEDIUM	Section present but incomplete or non-compliant	Wrong item counts, old allowed-tools format
LOW	Cosmetic or minor quality issues	Untagged code blocks, missing changelog

Grade Boundaries:

Score	Grade	Interpretation
90-100	A	Production ready, exemplary
80-89	B	Good, minor improvements needed
70-79	C	Adequate, some gaps to address
60-69	D	Below standard, significant work needed
<60	F	Major overhaul required

Gate: Report generated with all sections populated and evidence cited. Evaluation complete.

Examples

Example 1: Single Skill Evaluation

User says: "Evaluate the test-driven-development skill" Actions:

Confirm skills/test-driven-development/ exists (IDENTIFY)
Check YAML, Operator Context, Error Handling sections (STRUCTURAL)
Count lines in SKILL.md + references (CONTENT)
Syntax-check any scripts, find placeholders (CODE)
Verify all referenced files exist (INTEGRATION)
Generate scored report (REPORT) Result: Structured report with score, grade, and prioritized findings

Example 2: Collection Batch Evaluation

User says: "Audit all agents and skills" Actions:

List all agents/.md and skills//SKILL.md (IDENTIFY)
Run Steps 2-5 for each target (EVALUATE)
Generate individual reports + collection summary (REPORT) Result: Per-item scores plus distribution, top performers, and improvement areas

Example 3: V2 Migration Compliance Check

User says: "Check if systematic-refactoring skill meets v2 standards" Actions:

Confirm skills/systematic-refactoring/ exists (IDENTIFY)
Check YAML uses list allowed-tools, pipe description, version 2.0.0 (STRUCTURAL)
Verify Operator Context has correct item counts: Hardcoded 5-8, Default 5-8, Optional 3-5 (STRUCTURAL)
Confirm CAN/CANNOT sections, gates in Instructions, anti-rationalization table (STRUCTURAL)
Count total lines, run code checks (CONTENT + CODE)
Generate scored report highlighting v2 gaps (REPORT) Result: Report with specific v2 compliance gaps and required actions

Error Handling

Error: "File Not Found"

Cause: Agent or skill path incorrect, or item was deleted Solution: Verify path exists with ls before evaluation. If truly missing, exclude from batch and note in report.

Error: "Cannot Parse YAML Front Matter"

Cause: Malformed YAML — missing --- delimiters, bad indentation, or invalid syntax Solution: Flag as HIGH priority structural failure. Score YAML section as 0/10. Include the specific parse error in the report.

Error: "Python Syntax Error in Script"

Cause: Validation script has syntax issues Solution: Run python3 -m py_compile and capture the specific error. Score validation script as 0/10. Include error output in report.

Error: "Operator Context Item Counts Out of Range"

Cause: v2 standard requires Hardcoded 5-8, Default 5-8, Optional 3-5 items. Skill has too few or too many. Solution:

Count actual items per behavior type (bold items starting with - **)
If too few: flag as MEDIUM priority — behaviors likely need to be split or added
If too many: flag as LOW priority — behaviors may need consolidation
Score Operator Context at partial credit (10/20) if counts are wrong

References

Reference Files

${CLAUDE_SKILL_DIR}/references/scoring-rubric.md - Full/partial/no credit breakdowns per rubric category
${CLAUDE_SKILL_DIR}/references/report-templates.md - Standard report format templates (single, batch, comparison)
${CLAUDE_SKILL_DIR}/references/common-issues.md - Frequently found issues with fix templates
${CLAUDE_SKILL_DIR}/references/batch-evaluation.md - Batch evaluation procedures and collection summary format

agent-evaluation