audit-agents-skills

Installation

SKILL.md

Audit Agents/Skills/Commands (Advanced Skill)

Comprehensive quality audit system for Claude Code agents, skills, and commands. Provides quantitative scoring, comparative analysis, and production readiness grading based on industry best practices.

Purpose

Problem: Manual validation of agents/skills is error-prone and inconsistent. According to the LangChain Agent Report 2026, 29.5% of organizations deploy agents without systematic evaluation, leading to "agent bugs" as the top challenge (18% of teams).

Solution: Automated quality scoring across 16 weighted criteria with production readiness thresholds (80% = Grade B minimum for production deployment).

Key Features:

Quantitative scoring (32 points for agents/skills, 20 for commands)
Weighted criteria (Identity 3x, Prompt 2x, Validation 1x, Design 2x)
Production readiness grading (A-F scale with 80% threshold)
Comparative analysis vs reference templates
JSON/Markdown dual output for programmatic integration
Fix suggestions for failing criteria

Modes

Mode	Usage	Output
Quick Audit	Top-5 critical criteria only	Fast pass/fail (3-5 min for 20 files)
Full Audit	All 16 criteria per file	Detailed scores + recommendations (10-15 min)
Comparative	Full + benchmark vs templates	Analysis + gap identification (15-20 min)

Default: Full Audit (recommended for first run)

Methodology

Why These Criteria?

The 16-criteria framework is derived from:

Claude Code Best Practices (Ultimate Guide line 4921: Agent Validation Checklist)
Industry Data (LangChain Agent Report 2026: evaluation gaps)
Production Failures (Community feedback on hardcoded paths, missing error handling)
Composition Patterns (Skills should reference other skills, agents should be modular)

Scoring Philosophy

Weight Rationale:

Identity (3x): If users can't find/invoke the agent, quality is irrelevant (discoverability > quality)
Prompt (2x): Determines reliability and accuracy of outputs
Validation (1x): Improves robustness but is secondary to core functionality
Design (2x): Impacts long-term maintainability and scalability

Grade Standards:

A (90-100%): Production-ready, minimal risk
B (80-89%): Good, meets production threshold
C (70-79%): Needs improvement before production
D (60-69%): Significant gaps, not production-ready
F (<60%): Critical issues, requires major refactoring

Industry Alignment: The 80% threshold aligns with software engineering best practices for production deployment (e.g., code coverage >80%, security scan pass rates).

Workflow

Phase 1: Discovery

Scan directories:

.claude/agents/
.claude/skills/
.claude/commands/
examples/agents/      (if exists)
examples/skills/      (if exists)
examples/commands/    (if exists)

Classify files by type (agent/skill/command)

Load reference templates (for Comparative mode):

guide/examples/agents/     (benchmark files)
guide/examples/skills/     (benchmark files)
guide/examples/commands/   (benchmark files)

Phase 2: Scoring Engine

Load scoring criteria from scoring/criteria.yaml:

agents:
  max_points: 32
  categories:
    identity:
      weight: 3
      criteria:
        - id: A1.1
          name: "Clear name"
          points: 3
          detection: "frontmatter.name exists and is descriptive"
        # ... (16 total criteria)

For each file:

Parse frontmatter (YAML)
Extract content sections
Run detection patterns (regex, keyword search)
Calculate score: (points / max_points) × 100
Assign grade (A-F)

Phase 3: Comparative Analysis (Comparative Mode Only)

For each project file:

Find closest matching template (by description similarity)
Compare scores per criterion
Identify gaps: template_score - project_score
Flag significant gaps (>10 points difference)

Example:

Project file: .claude/agents/debugging-specialist.md (Score: 78%, Grade C)
Closest template: examples/agents/debugging-specialist.md (Score: 94%, Grade A)

Gaps:
- Anti-hallucination measures: -2 points (template has, project missing)
- Edge cases documented: -1 point (template has 5 examples, project has 1)
- Integration documented: -1 point (template references 3 skills, project none)

Total gap: 16 points (explains C vs A difference)

Phase 4: Report Generation

Markdown Report (audit-report.md):

Summary table (overall + by type)
Individual scores with top issues
Detailed breakdown per file (collapsible)
Prioritized recommendations

JSON Output (audit-report.json):

{
  "metadata": {
    "project_path": "/path/to/project",
    "audit_date": "2026-02-07",
    "mode": "full",
    "version": "1.0.0"
  },
  "summary": {
    "overall_score": 82.5,
    "overall_grade": "B",
    "total_files": 15,
    "production_ready_count": 10,
    "production_ready_percentage": 66.7
  },
  "by_type": {
    "agents": { "count": 5, "avg_score": 85.2, "grade": "B" },
    "skills": { "count": 8, "avg_score": 78.9, "grade": "C" },
    "commands": { "count": 2, "avg_score": 92.0, "grade": "A" }
  },
  "files": [
    {
      "path": ".claude/agents/debugging-specialist.md",
      "type": "agent",
      "score": 78.1,
      "grade": "C",
      "points_obtained": 25,
      "points_max": 32,
      "failed_criteria": [
        {
          "id": "A2.4",
          "name": "Anti-hallucination measures",
          "points_lost": 2,
          "recommendation": "Add section on source verification"
        }
      ]
    }
  ],
  "top_issues": [
    {
      "issue": "Missing error handling",
      "affected_files": 8,
      "impact": "Runtime failures unhandled",
      "priority": "high"
    }
  ]
}

Phase 5: Fix Suggestions (Optional)

For each failing criterion, generate actionable fix:

### File: .claude/agents/debugging-specialist.md
**Issue**: Missing anti-hallucination measures (2 points lost)

**Fix**:
Add this section after "Methodology":

## Source Verification

- Always cite sources for technical claims
- Use phrases: "According to [documentation]...", "Based on [tool output]..."
- If uncertain, state: "I don't have verified information on..."
- Never invent: statistics, version numbers, API signatures, stack traces

**Detection**: Grep for keywords: "verify", "cite", "source", "evidence"

Scoring Criteria

See scoring/criteria.yaml for complete definitions. Summary:

Agents (32 points max)

Category	Weight	Criteria Count	Max Points
Identity	3x	4	12
Prompt Quality	2x	4	8
Validation	1x	4	4
Design	2x	4	8

Key Criteria:

Clear name (3 pts): Not generic like "agent1"
Description with triggers (3 pts): Contains "when"/"use"
Role defined (2 pts): "You are..." statement
3+ examples (1 pt): Usage scenarios documented
Single responsibility (2 pts): Focused, not "general purpose"

Skills (32 points max)

Category	Weight	Criteria Count	Max Points
Structure	3x	4	12
Content	2x	4	8
Technical	1x	4	4
Design	2x	4	8

Key Criteria:

Valid SKILL.md (3 pts): Proper naming
Name valid (3 pts): Lowercase, 1-64 chars, no spaces
Methodology described (2 pts): Workflow section exists
No hardcoded paths (1 pt): No /Users/, /home/
Clear triggers (2 pts): "When to use" section

Commands (20 points max)

Category	Weight	Criteria Count	Max Points
Structure	3x	4	12
Quality	2x	4	8

Key Criteria:

Valid frontmatter (3 pts): name + description
Argument hint (3 pts): If uses $ARGUMENTS
Step-by-step workflow (3 pts): Numbered sections
Error handling (2 pts): Mentions failure modes

Detection Patterns

Frontmatter Parsing

import yaml
import re

def parse_frontmatter(content):
    match = re.search(r'^---\n(.*?)\n---', content, re.DOTALL)
    if match:
        return yaml.safe_load(match.group(1))
    return None

Keyword Detection

def has_keywords(text, keywords):
    text_lower = text.lower()
    return any(kw in text_lower for kw in keywords)

# Example
has_trigger = has_keywords(description, ['when', 'use', 'trigger'])
has_error_handling = has_keywords(content, ['error', 'failure', 'fallback'])

Overlap Detection (Duplication Check)

def jaccard_similarity(text1, text2):
    words1 = set(text1.lower().split())
    words2 = set(text2.lower().split())
    intersection = words1 & words2
    union = words1 | words2
    return len(intersection) / len(union) if union else 0

# Flag if similarity > 0.5 (50% keyword overlap)
if jaccard_similarity(desc1, desc2) > 0.5:
    issues.append("High overlap with another file")

Token Counting (Approximate)

def estimate_tokens(text):
    # Rough estimate: 1 token ≈ 0.75 words
    word_count = len(text.split())
    return int(word_count * 1.3)

# Check budget
tokens = estimate_tokens(file_content)
if tokens > 5000:
    issues.append("File too large (>5K tokens)")

Industry Context

Source: LangChain Agent Report 2026 (public report, page 14-22)

Key Findings:

29.5% of organizations deploy agents without systematic evaluation
18% cite "agent bugs" as their primary challenge
Only 12% use automated quality checks (88% manual or none)
43% report difficulty maintaining agent quality over time
Top issues: Hallucinations (31%), poor error handling (28%), unclear triggers (22%)

Implications:

Automation gap: Most teams rely on manual checklists (error-prone at scale)
Quality debt: Agents deployed without validation accumulate technical debt
Maintenance burden: 43% struggle with quality over time (no tracking system)

This skill addresses:

Automation: Replaces manual checklists with quantitative scoring
Tracking: JSON output enables trend analysis over time
Standards: 80% threshold provides clear production gate

Output Examples

Quick Audit (Top-5 Criteria)

# Quick Audit: Agents/Skills/Commands

**Files**: 15 (5 agents, 8 skills, 2 commands)
**Critical Issues**: 3 files fail top-5 criteria

## Top-5 Criteria (Pass/Fail)

| File | Valid Name | Has Triggers | Error Handling | No Hardcoded Paths | Examples |
|------|------------|--------------|----------------|--------------------|----------|
| agent1.md | ✅ | ✅ | ❌ | ✅ | ❌ |
| skill2/ | ✅ | ❌ | ✅ | ❌ | ✅ |

## Action Required

1. **Add error handling**: 5 files
2. **Remove hardcoded paths**: 3 files
3. **Add usage examples**: 4 files

Full Audit

See Phase 4: Report Generation above for full structure.

Comparative (Full + Benchmarks)

# Comparative Audit

## Project vs Templates

| File | Project Score | Template Score | Gap | Top Missing |
|------|---------------|----------------|-----|-------------|
| debugging-specialist.md | 78% (C) | 94% (A) | -16 pts | Anti-hallucination, edge cases |
| testing-expert/ | 85% (B) | 91% (A) | -6 pts | Integration docs |

## Recommendations

Focus on these gaps to reach template quality:
1. **Anti-hallucination measures** (8 files): Add source verification sections
2. **Edge case documentation** (5 files): Add failure scenario examples
3. **Integration documentation** (4 files): List compatible agents/skills

Usage

Basic (Full Audit)

# In Claude Code
Use skill: audit-agents-skills

# Specify path
Use skill: audit-agents-skills for ~/projects/my-app

With Options

# Quick audit (fast)
Use skill: audit-agents-skills with mode=quick

# Comparative (benchmark analysis)
Use skill: audit-agents-skills with mode=comparative

# Generate fixes
Use skill: audit-agents-skills with fixes=true

# Custom output path
Use skill: audit-agents-skills with output=~/Desktop/audit.json

JSON Output Only

# For programmatic integration
Use skill: audit-agents-skills with format=json output=audit.json

Integration with CI/CD

Pre-commit Hook

#!/bin/bash
# .git/hooks/pre-commit

# Run quick audit on changed agent/skill/command files
changed_files=$(git diff --cached --name-only | grep -E "^\.claude/(agents|skills|commands)/")

if [ -n "$changed_files" ]; then
    echo "Running quick audit on changed files..."
    # Run audit (requires Claude Code CLI wrapper)
    # Exit with 1 if any file scores <80%
fi

GitHub Actions

name: Audit Agents/Skills
on: [pull_request]
jobs:
  audit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run quality audit
        run: |
          # Run audit skill
          # Parse JSON output
          # Fail if overall_score < 80

Comparison: Command vs Skill

Aspect	Command (`/audit-agents-skills`)	Skill (this file)
Scope	Current project only	Multi-project, comparative
Output	Markdown report	Markdown + JSON
Speed	Fast (5-10 min)	Slower (10-20 min with comparative)
Depth	Standard 16 criteria	Same + benchmark analysis
Fix suggestions	Via `--fix` flag	Built-in with recommendations
Programmatic	Terminal output	JSON for CI/CD integration
Best for	Quick checks, dev workflow	Deep audits, quality tracking

Recommendation: Use command for daily checks, skill for release gates and quality tracking.

Maintenance

Updating Criteria

Edit scoring/criteria.yaml:

agents:
  categories:
    identity:
      criteria:
        - id: A1.5  # New criterion
          name: "API versioning specified"
          points: 3
          detection: "mentions API version or compatibility"

Version bump: Increment version in frontmatter when criteria change.

Adding File Types

To support new file types (e.g., "workflows"):

Add to scoring/criteria.yaml:

workflows:
  max_points: 24
  categories: [...]

Update detection logic (file path patterns)
Update report templates

Command version: .claude/commands/audit-agents-skills.md
Agent Validation Checklist: guide line 4921 (manual 16 criteria)
Skill Validation: guide line 5491 (spec documentation)
Reference templates: examples/agents/, examples/skills/, examples/commands/

Changelog

v1.0.0 (2026-02-07):

Initial release
16-criteria framework (agents/skills/commands)
3 audit modes (quick/full/comparative)
JSON + Markdown output
Fix suggestions
Industry context (LangChain 2026 report)

Skill ready for use: audit-agents-skills

Related skills

More from florianbruniaux/claude-code-ultimate-guide

Installs

Repository

florianbruniaux…te-guide

GitHub Stars

4.2K

First Seen

Mar 1, 2026

Security Audits

Gen Agent Trust HubWarn

SocketPass

SnykPass

audit-agents-skills

Audit Agents/Skills/Commands (Advanced Skill)

Purpose

Modes

Methodology

Why These Criteria?

Scoring Philosophy

Workflow

Phase 1: Discovery

Phase 2: Scoring Engine

Phase 3: Comparative Analysis (Comparative Mode Only)

Phase 4: Report Generation

Phase 5: Fix Suggestions (Optional)

Scoring Criteria

Agents (32 points max)

Skills (32 points max)

Commands (20 points max)

Detection Patterns

Frontmatter Parsing

Keyword Detection

Overlap Detection (Duplication Check)

Token Counting (Approximate)

Industry Context

Output Examples

Quick Audit (Top-5 Criteria)

Full Audit

Comparative (Full + Benchmarks)

Usage

Basic (Full Audit)

With Options

JSON Output Only

Integration with CI/CD

Pre-commit Hook

GitHub Actions

Comparison: Command vs Skill

Maintenance

Updating Criteria

Adding File Types

Related

Changelog

More from florianbruniaux/claude-code-ultimate-guide

rtk-optimizer

design-patterns

landing-page-generator

release-notes-generator

talk-pipeline

talk-stage6-revision