evaluation
Evaluation Skill
Evaluate LLM outputs systematically with rubrics, handle non-determinism, and implement LLM-as-judge patterns.
Core Insight: The 95% Variance Finding
Research shows 95% of output variance comes from just two sources:
- 80% from prompt tokens (wording, structure, examples)
- 15% from random seed/sampling
Temperature, model version, and other factors account for only 5%.
Implication: Focus evaluation on prompt quality, not model tweaking.
What's Included
Examples (examples/)
- Prompt comparison - A/B testing prompts with rubrics
- Model evaluation - Comparing outputs across models
- Regression testing - Detecting output degradation
Reference Guides (reference/)
- Rubric design - Multi-dimensional evaluation criteria
- LLM-as-judge - Using LLMs to evaluate LLM outputs
- Statistical methods - Handling non-determinism
Templates (templates/)
- Rubric templates - Ready-to-use evaluation criteria
- Judge prompts - LLM-as-judge prompt templates
- Test case format - Structured test case templates
Checklists (checklists/)
- Evaluation setup - Before running evaluations
- Rubric validation - Ensuring rubric quality
Key Concepts
1. Multi-Dimensional Rubrics
Don't use single scores. Break down evaluation into dimensions:
| Dimension | Weight | Criteria |
|---|---|---|
| Accuracy | 30% | Factually correct, no hallucinations |
| Completeness | 25% | Addresses all requirements |
| Clarity | 20% | Well-organized, easy to understand |
| Conciseness | 15% | No unnecessary content |
| Format | 10% | Follows specified structure |
2. Handling Non-Determinism
LLMs are non-deterministic. Handle with:
Strategy 1: Multiple Runs
- Run same prompt 3-5 times
- Report mean and variance
- Flag high-variance cases
Strategy 2: Seed Control
- Set temperature=0 for reproducibility
- Document seed for debugging
- Accept some variation is normal
Strategy 3: Statistical Significance
- Use paired comparisons
- Require 70%+ win rate for "better"
- Report confidence intervals
3. LLM-as-Judge Pattern
Use a judge LLM to evaluate outputs:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Prompt │────▶│ Test LLM │────▶│ Output │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐ ┌─────────────┐
│ Rubric │────▶│ Judge LLM │
└─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Score │
└─────────────┘
Best Practice: Use stronger model as judge (Opus judges Sonnet).
4. Test Case Design
Structure test cases with:
interface TestCase {
id: string
input: string // User message or context
expectedBehavior: string // What output should do
rubric: RubricItem[] // Evaluation criteria
groundTruth?: string // Optional gold standard
metadata: {
category: string
difficulty: 'easy' | 'medium' | 'hard'
createdAt: string
}
}
Evaluation Workflow
Step 1: Define Rubric
rubric:
dimensions:
- name: accuracy
weight: 0.3
criteria:
5: "Completely accurate, no errors"
4: "Minor errors, doesn't affect correctness"
3: "Some errors, partially correct"
2: "Significant errors, mostly incorrect"
1: "Completely incorrect or hallucinated"
Step 2: Create Test Cases
test_cases:
- id: "code-gen-001"
input: "Write a function to reverse a string"
expected_behavior: "Returns working reverse function"
ground_truth: |
function reverse(s: string): string {
return s.split('').reverse().join('')
}
Step 3: Run Evaluation
# Run test suite
python evaluate.py --suite code-generation --runs 3
# Output
# ┌─────────────────────────────────────────────┐
# │ Test Suite: code-generation │
# │ Total: 50 | Pass: 47 | Fail: 3 │
# │ Accuracy: 94% (±2.1%) │
# │ Avg Score: 4.2/5.0 │
# └─────────────────────────────────────────────┘
Step 4: Analyze Results
Look for:
- Low-scoring dimensions - Target for improvement
- High-variance cases - Prompt needs clarification
- Regression from baseline - Investigate changes
Grey Haven Integration
With TDD Workflow
1. Write test cases (expected behavior)
2. Run baseline evaluation
3. Modify prompt/implementation
4. Run evaluation again
5. Compare: new scores ≥ baseline?
With Pipeline Architecture
acquire → prepare → process → parse → render → EVALUATE
│
┌───────┴───────┐
│ Compare to │
│ ground truth │
│ or rubric │
└───────────────┘
With Prompt Engineering
Current prompt → Evaluate → Score: 3.2
Apply principles → Improve prompt
New prompt → Evaluate → Score: 4.1 ✓
Use This Skill When
- Testing new prompts before production
- Comparing prompt variations (A/B testing)
- Validating model outputs meet quality bar
- Detecting regressions after changes
- Building evaluation datasets
- Implementing automated quality gates
Related Skills
prompt-engineering- Improve prompts based on evaluationtesting-strategy- Overall testing approachesllm-project-development- Pipeline with evaluation stage
Quick Start
# Design your rubric
cat templates/rubric-template.yaml
# Create test cases
cat templates/test-case-template.yaml
# Learn LLM-as-judge
cat reference/llm-as-judge-guide.md
# Run evaluation checklist
cat checklists/evaluation-setup-checklist.md
Skill Version: 1.0 Key Finding: 95% variance from prompts (80%) + sampling (15%) Last Updated: 2025-01-15
More from greyhaven-ai/claude-code-config
grey-haven-creative-writing
Professional writing assistance for blogs, research articles, fiction, essays, and marketing copy. Use when users want to write, edit, or improve any form of written content. Triggers: 'write a blog', 'write an article', 'help me write', 'write a story', 'write a chapter', 'draft an essay', 'creative writing', 'improve my writing', 'edit my writing', 'write copy', 'content writing'.
139creative-writing
Professional writing assistance for blogs, research articles, fiction, essays, and marketing copy. Use when users want to write, edit, or improve any form of written content. Triggers: 'write a blog', 'write an article', 'help me write', 'write a story', 'write a chapter', 'draft an essay', 'creative writing', 'improve my writing', 'edit my writing', 'write copy', 'content writing'.
34grey-haven-prompt-engineering
Master 26 documented prompt engineering principles for crafting effective LLM prompts with 400%+ quality improvement. Includes templates, anti-patterns, and quality checklists for technical, learning, creative, and research tasks. Use when writing prompts for LLMs, improving AI response quality, training on prompting, designing agent instructions, or when user mentions 'prompt engineering', 'better prompts', 'LLM quality', 'prompt templates', 'AI prompts', 'prompt principles', or 'prompt optimization'.
12grey-haven-ontological-documentation
Create comprehensive ontological documentation for Grey Haven systems - extract domain concepts from TanStack Start and FastAPI codebases, model semantic relationships, generate visual representations of system architecture, and document business domains. Use when onboarding, documenting architecture, or analyzing legacy systems.
12grey-haven-api-design
Design RESTful APIs following Grey Haven standards - FastAPI routes, Pydantic schemas, HTTP status codes, pagination, filtering, error responses, OpenAPI docs, and multi-tenant patterns. Use when creating API endpoints, designing REST resources, implementing server functions, configuring FastAPI, writing Pydantic schemas, setting up error handling, implementing pagination, or when user mentions 'API', 'endpoint', 'REST', 'FastAPI', 'Pydantic', 'server function', 'OpenAPI', 'pagination', 'validation', 'error handling', 'rate limiting', 'CORS', or 'authentication'.
12grey-haven-llm-project-development
Build LLM-powered applications and pipelines using proven methodology - task-model fit analysis, pipeline architecture, structured outputs, file-based state, and cost estimation. Use when building AI features, data processing pipelines, agents, or any LLM-integrated system. Inspired by Karpathy's methodology and production case studies.
11