advanced-evaluation
SKILL.md
Advanced Evaluation
Production-grade techniques for evaluating LLM outputs using LLMs as judges.
Evaluation Taxonomy
Direct Scoring
Single LLM rates one response on a defined scale.
- Best for: Objective criteria (factual accuracy, instruction following)
- Reliability: Moderate to high for well-defined criteria
- Failure mode: Score calibration drift
Pairwise Comparison
LLM compares two responses and selects the better one.
- Best for: Subjective preferences (tone, style, persuasiveness)
- Reliability: Higher than direct scoring for preferences
- Failure mode: Position bias, length bias
The Bias Landscape
| Bias | Description | Mitigation |
|---|---|---|
| Position | First-position responses favored | Swap positions, majority vote |
| Length | Longer = higher rating | Explicit prompting to ignore length |
| Self-Enhancement | Models rate own outputs higher | Use different model for evaluation |
| Verbosity | Detailed explanations favored | Criteria-specific rubrics |
| Authority | Confident tone rated higher | Require evidence citation |
Direct Scoring Implementation
You are an expert evaluator assessing response quality.
## Task
Evaluate the following response against each criterion.
## Original Prompt
{prompt}
## Response to Evaluate
{response}
## Criteria
{criteria with descriptions and weights}
## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-{max} scale)
3. Justify your score with evidence
4. Suggest one specific improvement
## Output Format
Respond with structured JSON containing scores, justifications, and summary.
Critical: Always require justification BEFORE the score. Improves reliability 15-25%.
Pairwise Comparison Implementation
Position Bias Mitigation Protocol:
- First pass: A in first position, B in second
- Second pass: B in first position, A in second
- Consistency check: If passes disagree, return TIE
- Final verdict: Consistent winner with averaged confidence
## Critical Instructions
- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to specified criteria
- Ties are acceptable when genuinely equivalent
Rubric Generation
Components:
- Level descriptions with clear boundaries
- Observable characteristics for each level
- Examples for each level
- Edge case guidance
- General scoring principles
Strictness levels:
- Lenient: Lower bar, encourages iteration
- Balanced: Typical production use
- Strict: High-stakes or safety-critical
Decision Framework
Is there objective ground truth?
├── Yes → Direct Scoring
│ (factual accuracy, instruction following)
└── No → Is it a preference judgment?
├── Yes → Pairwise Comparison
│ (tone, style, persuasiveness)
└── No → Reference-based evaluation
(summarization, translation)
Scaling Evaluation
| Approach | Use Case | Trade-off |
|---|---|---|
| Panel of LLMs | High-stakes decisions | More expensive, more reliable |
| Hierarchical | Large volumes | Fast screening + careful edge cases |
| Human-in-loop | Critical applications | Best reliability, feedback loop |
Guidelines
- Always require justification before scores
- Always swap positions in pairwise comparison
- Match scale granularity to rubric specificity
- Separate objective and subjective criteria
- Include confidence scores calibrated to consistency
- Define edge cases explicitly
- Validate against human judgments
Weekly Installs
3
Repository
5dlabs/ctoFirst Seen
Jan 24, 2026
Security Audits
Installed on
claude-code2
windsurf1
trae1
opencode1
codex1
antigravity1