Evaluate

Comprehensive quality evaluation for any AI-generated artifact. Produces its report as a visualization.

How It Works

┌──────────────────────────────────────────────┐
│                                              │
│  Phase 1: SPEC GENERATION                    │
│  Analyze the artifact type                   │
│  Generate tailored evaluation criteria       │
│  Define scoring dimensions + weights         │
│  Set quality gates                           │
│           │                                  │
│           ▼                                  │
│  Phase 2: EVALUATION                         │
│  Run automated checks (when possible)        │
│  Visual/manual inspection                    │
│  Score each dimension with evidence          │
│  Identify systemic vs local issues           │
│           │                                  │
│           ▼                                  │
│  Phase 3: REPORT (via /visualize)            │
│  Generate a beautiful HTML eval report       │
│  Scores, charts, screenshots, fix list       │
│  Radar chart of dimensions                   │
│  Before/after tracking                       │
│                                              │
└──────────────────────────────────────────────┘

Phase 1: Spec Generation

For any artifact, generate evaluation specs by analyzing:

1. Identify Artifact Type

HTML Visualization → visual design, interactivity, technical, content, shareability
Code/Project → correctness, readability, architecture, test coverage, performance
Document/Report → clarity, structure, accuracy, completeness, tone
Conversation/Agent → helpfulness, accuracy, tone, efficiency, safety
Slide Deck → all visualization dims + narrative flow, persuasion, pacing
Dashboard → data accuracy, information density, scannability, actionability
Custom → derive dimensions from the skill's SKILL.md and stated goals

2. Generate Dimensions

For each artifact type, produce 6-10 evaluation dimensions. Each dimension needs:

Name — short, clear label
Description — what this dimension measures
Weight — percentage (all weights sum to 100%)
Scoring anchors — what does a 10, 8, 6, 4 look like?
Automated checks — any programmatic tests (if applicable)
Deductions — specific issues and their point costs

3. Set Quality Gates

Define gates based on the artifact's purpose:

Gate	Criteria	Meaning
🚀 EXCEPTIONAL	Overall ≥ 9.5, all ≥ 9	Best-in-class. Share everywhere.
✅ SHIP	Overall ≥ 9.0, all ≥ 8	Production-ready.
⚠️ ACCEPTABLE	Overall ≥ 8.0, all ≥ 7	Usable but not impressive.
🔧 NEEDS WORK	Overall ≥ 7.0 or any < 7	Fix before releasing.
❌ FAIL	Overall < 7.0 or any < 5	Major rework.

4. Output Spec Document

Write the spec to eval-spec-[artifact-name].md for reference and reuse.

Phase 2: Evaluation

For HTML Visualizations

Open in browser at 3 viewports (1280×720, 768×1024, 375×667).

Automated audit (run in browser console):

(function() {
  const audit = {};
  const style = [...document.querySelectorAll('style')].map(s => s.textContent).join(' ');
  const html = document.documentElement.outerHTML;

  // Structure
  audit.hasDoctype = /^<!doctype html>/i.test(html);
  audit.hasLangAttr = !!document.documentElement.lang;
  audit.hasCharset = !!document.querySelector('meta[charset]');
  audit.hasViewport = !!document.querySelector('meta[name="viewport"]');
  audit.hasTitle = document.title.length > 0;

  // Menu system
  audit.menuExists = !!document.querySelector('.viz-menu');
  audit.menuHasTheme = !!html.match(/cycleTheme|themeLabel/i);
  audit.menuHasDownload = !!html.match(/htmlToImage|html-to-image/i);
  audit.menuHasPrint = !!html.match(/window\.print/i);

  // Theme system
  audit.hasCSSVars = !!style.match(/--bg\s*:/);
  audit.hasDarkTheme = !!style.match(/(\.theme-dark|:root)[\s\S]*?--bg/);
  audit.hasLightTheme = !!style.match(/\.theme-light/);
  audit.themePersistedToStorage = !!html.match(/localStorage.*theme/i);

  // Typography
  audit.hasInterFont = !!html.match(/fonts\.googleapis.*Inter|font-family.*Inter/i);
  audit.hasFontFallback = !!style.match(/-apple-system|system-ui/);
  audit.bodyFontSize = parseFloat(getComputedStyle(document.body).fontSize);
  audit.bodyFontOK = audit.bodyFontSize >= 14;

  // Layout
  audit.usesFlexOrGrid = !!(style.match(/display\s*:\s*(flex|grid)/));
  audit.hasMaxWidth = !!style.match(/max-width/);
  audit.hasResponsiveBreakpoints = !!style.match(/@media.*max-width|@media.*min-width|sm:|md:|lg:/);

  // Print & Accessibility
  audit.hasPrintStyles = !!style.match(/@media\s*print/);
  audit.hasPrintColorAdjust = !!style.match(/print-color-adjust/);
  audit.hasReducedMotion = !!style.match(/prefers-reduced-motion/);
  audit.hasAriaLabels = !!html.match(/aria-label/);
  audit.hasSemanticHTML = !!html.match(/<(header|main|nav|section|article|footer)/);

  // Animations
  audit.hasKeyframes = !!style.match(/@keyframes/);
  audit.hasTransitions = !!style.match(/transition\s*:/);

  // Performance
  audit.fileSizeKB = Math.round(new Blob([html]).size / 1024);
  audit.fileSizeOK = audit.fileSizeKB < 200;
  audit.noExternalImages = document.querySelectorAll('img[src^="http"]').length === 0;
  audit.htmlToImageLoaded = typeof htmlToImage !== 'undefined';

  // Summary
  const bools = Object.entries(audit).filter(([k,v]) => typeof v === 'boolean');
  const passed = bools.filter(([k,v]) => v).length;
  audit._passed = passed;
  audit._total = bools.length;
  audit._percent = Math.round(passed / bools.length * 100);
  audit._failures = bools.filter(([k,v]) => !v).map(([k]) => k);

  console.table(audit);
  return audit;
})();

Visual scoring — 8 dimensions for visualizations:

#	Dimension	Weight	10 =	6 =
D1	First Impression	15%	Apple keynote quality	Generic template feel
D2	Typography	15%	Perfect hierarchy, Inter font, fluid sizing	All same size, no hierarchy
D3	Color & Contrast	10%	Harmonious, WCAG AA, both themes beautiful	Clashing, low contrast
D4	Layout & Spacing	15%	Consistent rhythm, responsive, generous space	Cramped, broken at mobile
D5	Content Quality	15%	Clear message in 5 seconds, zero filler	Confusing, placeholder text
D6	Interactivity	10%	Menu + theme + download + print all flawless	Missing features, broken
D7	Technical	10%	Zero errors, semantic, accessible, print-ready	Console errors, broken layout
D8	Shareability	10%	Would tweet this unprompted	Worse than Canva

For Code/Projects

Dimensions: Correctness, Readability, Architecture, Error Handling, Performance, Testing, Documentation, Security

For Documents

Dimensions: Clarity, Structure, Accuracy, Completeness, Tone, Formatting, Actionability, Brevity

For Agent Conversations

Dimensions: Helpfulness, Accuracy, Tone, Efficiency, Safety, Context Awareness, Tool Usage, Follow-through

Phase 3: Visual Report (via /visualize)

After scoring, generate the eval report as a beautiful HTML dashboard using the visualize skill:

Report Structure

Hero — artifact name, overall score (big number), quality gate badge
Radar Chart — all dimensions plotted on a radar/spider chart (Chart.js)
Dimension Cards — each dimension as a card with score, bar, key notes
Automated Audit — pass/fail checklist with percentages
Screenshots — key views embedded (if HTML artifact)
Fix List — prioritized fixes as a kanban-style layout (critical / high / medium / low)
Systemic Issues — patterns that affect all outputs (flagged for SKILL.md fixes)
History — if re-evaluating, show before/after score comparison chart

Report Filename

eval-report-[artifact-name]-[date].html

The report itself must score ≥ 9.0 on the visualize eval criteria.

This is the ultimate dogfood test — our evaluation tool produces evaluations using our visualization tool.

The Improvement Loop

Generate artifact (any skill)
       ↓
/evaluate → Spec + Score + Visual Report
       ↓
Review report → identify fixes
       ↓
Fix (systemic → SKILL.md, local → artifact)
       ↓
/evaluate again → compare scores
       ↓
Ship when gate = SHIP or EXCEPTIONAL

Max 3 loops per artifact. If it can't reach SHIP in 3 loops, the problem is in the skill — update the skill's instructions, not the artifact.

Quick Start

# Evaluate a visualization
/evaluate path/to/visualization.html

# Evaluate with custom context
/evaluate path/to/code-project --type code

# Re-evaluate after fixes (tracks improvement)
/evaluate path/to/visualization.html --loop 2

# Generate specs only (no scoring)
/evaluate --specs-only --type dashboard

evaluate