llm-as-judge
LLM-as-Judge
Overview
Some quality criteria are inherently subjective — tone of voice, visual aesthetics, UX feel, documentation clarity, code readability. These cannot be verified by deterministic tests. The LLM-as-judge pattern provides structured, repeatable evaluation using an LLM reviewer with defined rubrics, ensuring subjective quality is measured consistently.
Announce at start: "I'm using the llm-as-judge skill to evaluate subjective quality."
Phase 1: Determine Evaluation Method
Goal: Decide whether LLM-as-judge is the right tool.
Decision Table: LLM-as-Judge vs Deterministic Tests
| Criterion Type | Method | Example |
|---|---|---|
| Objective, measurable | Deterministic test | "Response time < 200ms" |
| Structural, verifiable | Deterministic test | "Returns valid JSON" |
| Subjective, qualitative | LLM-as-judge | "Error messages are friendly and helpful" |
| Aesthetic, perceptual | LLM-as-judge | "UI feels clean and modern" |
| Linguistic, tonal | LLM-as-judge | "Documentation is clear for beginners" |
| Holistic, experiential | LLM-as-judge | "The onboarding flow feels intuitive" |
Rule of thumb: If you can write a boolean assertion, use a deterministic test. If evaluation requires judgment, use LLM-as-judge.
STOP — Do NOT proceed to Phase 2 until:
- Confirmed that criteria are genuinely subjective
- Deterministic testing has been ruled out
- Specific artifacts to evaluate are identified
Phase 2: Define Rubric
Goal: Create evaluation dimensions with weights and anchor points.
Actions
- Define 3-5 evaluation dimensions
- Assign weights (must sum to 1.0)
- Define anchor points for each dimension (1=worst, 5=adequate, 10=best)
- Set passing threshold
Rubric Structure
| Dimension | Weight | Scale | Anchor: 1 | Anchor: 5 | Anchor: 10 |
|---|---|---|---|---|---|
| [Name] | 0.X | 1-10 | [Worst case] | [Adequate] | [Excellent] |
Threshold Selection Table
| Quality Level | Threshold | Use For |
|---|---|---|
| Minimum viable | 5.0 | Internal docs, draft content |
| Production quality | 7.0 | User-facing content, public APIs |
| Excellence | 8.5 | Marketing, critical UX flows |
STOP — Do NOT proceed to Phase 3 until:
- 3-5 dimensions are defined with clear descriptions
- Weights sum to exactly 1.0
- Anchor points are specific (not vague)
- Passing threshold is set before evaluation
Phase 3: Evaluate
Goal: Submit the artifact with rubric to an LLM reviewer.
Review Request Structure
{
criteria: "Description of what to evaluate and the quality standard",
artifact: "The content to be evaluated (code, text, UI markup, etc.)",
rubric: [
{ dimension: "Clarity", weight: 0.3, description: "Is the content easy to understand?" },
{ dimension: "Tone", weight: 0.3, description: "Is the tone appropriate for the audience?" },
{ dimension: "Completeness", weight: 0.2, description: "Does it cover all necessary points?" },
{ dimension: "Engagement", weight: 0.2, description: "Does it hold the reader's interest?" }
],
passing_threshold: 7.0,
intelligence: "opus"
}
Review Response Structure
{
scores: [
{ dimension: "Clarity", score: 8, reasoning: "Well-structured with clear headings..." },
{ dimension: "Tone", score: 7, reasoning: "Professional but occasionally too formal..." },
{ dimension: "Completeness", score: 9, reasoning: "Covers all key topics..." },
{ dimension: "Engagement", score: 6, reasoning: "Could use more examples..." }
],
weighted_score: 7.4,
pass: true,
summary: "Overall good quality with minor tone and engagement improvements suggested.",
suggestions: [
"Add a real-world example in section 3",
"Use more conversational language in the introduction"
]
}
STOP — Do NOT proceed to Phase 4 until:
- Artifact has been submitted with full rubric
- Each dimension has been scored independently
- Reasoning is provided for every score
- Weighted total is calculated
Phase 4: Iterate or Accept
Goal: Act on the evaluation results.
Result Action Table
| Result | Action | Max Iterations |
|---|---|---|
| Pass (score >= threshold) | Accept the artifact, proceed | Done |
| Marginal fail (within 1 point) | Apply suggestions, re-evaluate once | 1 |
| Clear fail (> 1 point below) | Significant revision, apply all suggestions | 2 |
| Repeated fail (3+ attempts) | Escalate — rubric or approach may need adjustment | Escalate |
STOP — Evaluation complete when:
- Artifact passes threshold, OR
- 3 iterations completed and escalation decision made
Common Rubric Templates
Documentation Quality
Clarity (0.3): Is the content easy to understand for the target audience?
1=incomprehensible 5=adequate but requires re-reading 10=crystal clear
Accuracy (0.3): Is the information technically correct?
1=factually wrong 5=mostly correct 10=perfectly accurate
Completeness (0.2): Does it cover all necessary topics?
1=missing critical info 5=covers basics 10=comprehensive
Examples (0.2): Are there sufficient, relevant code examples?
1=no examples 5=some examples 10=rich, varied examples
Threshold: 7.0
Error Message Quality
Helpfulness (0.4): Does the message help the user fix the problem?
1=no help at all 5=vague direction 10=exact fix steps
Clarity (0.3): Is the message easy to understand?
1=cryptic 5=understandable 10=immediately clear
Tone (0.2): Is the tone empathetic and non-blaming?
1=hostile/blaming 5=neutral 10=empathetic and supportive
Actionability (0.1): Does it suggest a concrete next step?
1=no suggestion 5=vague suggestion 10=specific actionable step
Threshold: 7.5
Code Readability
Naming (0.3): Are variable/function names descriptive and consistent?
1=single letters everywhere 5=adequate 10=self-documenting
Structure (0.3): Is the code logically organized?
1=spaghetti 5=functional 10=elegant and clear
Simplicity (0.2): Is the code as simple as possible (but not simpler)?
1=over-engineered 5=reasonable 10=minimal and clear
Documentation (0.2): Are complex sections adequately commented?
1=no comments where needed 5=some comments 10=well-documented why
Threshold: 7.0
UX Copy
Clarity (0.3): Is the copy easy to understand?
1=confusing 5=understandable 10=immediately clear
Brevity (0.2): Is it concise without losing meaning?
1=verbose 5=adequate length 10=perfectly concise
Tone (0.2): Does it match the brand voice?
1=off-brand 5=neutral 10=perfectly on-brand
Actionability (0.2): Do CTAs clearly communicate what happens next?
1=unclear 5=adequate 10=crystal clear action
Accessibility (0.1): Is the language inclusive and jargon-free?
1=exclusionary 5=neutral 10=fully inclusive
Threshold: 7.5
Anti-Patterns / Common Mistakes
| Anti-Pattern | Why It Is Wrong | Correct Approach |
|---|---|---|
| Using LLM-as-judge for measurable criteria | Wastes tokens, less reliable than assertions | Use deterministic tests for anything quantifiable |
| Vague rubric dimensions ("is it good?") | Produces unreliable, inconsistent scores | Specific dimensions with anchored examples |
| No passing threshold defined | No way to determine pass/fail objectively | Always set threshold before evaluation |
| Adjusting rubric to pass failing content | Defeats the purpose of quality gates | Fix the content, not the rubric |
| Single evaluation without reasoning | Cannot improve without understanding why | Always require per-dimension reasoning |
| Using weaker model for evaluation | Lower quality judgments | Use strongest available model (Opus) |
| Skipping re-evaluation after changes | No verification that changes improved quality | Always re-evaluate after revisions |
Integration Points
| Skill | Relationship |
|---|---|
acceptance-testing |
LLM-as-judge handles subjective acceptance criteria |
spec-writing |
Specs may include subjective quality criteria |
code-review |
Readability evaluation during code review |
verification-before-completion |
Subjective validation gate before completion |
senior-prompt-engineer |
Prompt quality evaluation uses LLM-as-judge |
tech-docs-generator |
Documentation quality evaluation |
Downstream Steering Pattern
+----------+ +----------+ +----------+ +----------+
| SPECS |---->| CODE |---->| TESTS |---->| LLM-AS- |
| | | | |(determin)| | JUDGE |
| | | | | | |(subject) |
+----------+ +----------+ +----------+ +----+-----+
^ |
| backpressure |
+----------------------------------+
Deterministic tests validate objective criteria. LLM-as-judge validates subjective criteria. Both must pass.
Skill Type
FLEXIBLE — Adapt rubric dimensions and thresholds to context. The pattern structure (define rubric, evaluate, score, iterate) is fixed. Always set the threshold before evaluation, never after.