evaluation-rubrics
Evaluation Rubrics
Table of Contents
Example
Scenario: Evaluating technical blog posts (1-5 scale)
| Criterion | 1 (Poor) | 3 (Adequate) | 5 (Excellent) |
|---|---|---|---|
| Technical Accuracy | Multiple factual errors, misleading | Mostly correct, minor inaccuracies | Fully accurate, technically rigorous |
| Clarity | Confusing, jargon-heavy, poor structure | Clear to experts, some structure | Accessible to target audience, well-organized |
| Practical Value | No actionable guidance, theoretical only | Some examples, limited applicability | Concrete examples, immediately applicable |
| Originality | Rehashes common knowledge, no new insight | Some fresh perspective, builds on existing | Novel approach, advances understanding |
Scoring: Post A [4, 5, 3, 2] = 3.5 avg. Post B [5, 4, 5, 4] = 4.5 avg. Feedback for Post A: "Strong clarity (5) and good accuracy (4), but needs more practical examples (3) and offers less original insight (2)."
Workflow
Copy this checklist and track your progress:
Rubric Development Progress:
- [ ] Step 1: Define purpose and scope
- [ ] Step 2: Identify evaluation criteria
- [ ] Step 3: Design the scale
- [ ] Step 4: Write performance descriptors
- [ ] Step 5: Test and calibrate
- [ ] Step 6: Use and iterate
Step 1: Define purpose and scope
Clarify what you're evaluating, who evaluates, who uses results, what decisions depend on scores. See resources/template.md for scoping questions.
Step 2: Identify evaluation criteria
Brainstorm quality dimensions, prioritize most important/observable, balance coverage vs. simplicity (4-8 criteria typical). See resources/template.md for brainstorming framework.
Step 3: Design the scale
Choose number of levels (1-5, 1-4, 1-10), scale type (numeric, qualitative), anchors (what does each level mean?). See resources/methodology.md for scale selection guidance.
Step 4: Write performance descriptors
For each criterion × level, write observable description of what that performance looks like. See resources/template.md for writing guidelines.
Step 5: Test and calibrate
Have multiple reviewers score sample work, compare scores, discuss discrepancies, refine rubric. See resources/methodology.md for inter-rater reliability testing.
Step 6: Use and iterate
Apply rubric, collect feedback from evaluators and evaluatees, revise criteria/descriptors as needed. Validate using resources/evaluators/rubric_evaluation_rubrics.json. Minimum standard: Average score ≥ 3.5.
Common Patterns
Pattern 1: Analytic Rubric (Most Common)
- Structure: Multiple criteria (rows), multiple levels (columns), descriptor for each cell
- Use case: Detailed feedback needed, want to see performance across dimensions, diagnostic assessment
- Pros: Specific feedback, identifies strengths/weaknesses by criterion, high reliability
- Cons: Time-consuming to create and use, can feel reductive
- Example: Code review rubric (Correctness, Efficiency, Readability, Maintainability × 1-5 scale)
Pattern 2: Holistic Rubric
- Structure: Single overall score, descriptors integrate multiple criteria
- Use case: Quick overall judgment, summative assessment, criteria hard to separate
- Pros: Fast, intuitive, captures gestalt quality
- Cons: Less actionable feedback, lower reliability, can't diagnose specific weaknesses
- Example: Essay holistic scoring (1=poor essay, 3=adequate essay, 5=excellent essay with detailed descriptors)
Pattern 3: Single-Point Rubric
- Structure: Criteria listed with only "meets standard" descriptor, space to note above/below
- Use case: Growth mindset feedback, encourage self-assessment, less punitive feel
- Pros: Emphasizes improvement not deficit, simpler to create, encourages dialogue
- Cons: Less precision, requires written feedback to supplement
- Example: Design critique (list criteria like "Visual hierarchy", "Accessibility", note "+Clear focal point, -Poor contrast")
Pattern 4: Checklist (Binary)
- Structure: List of yes/no items, must-haves for acceptance
- Use case: Compliance checks, minimum quality gates, pass/fail decisions
- Pros: Very clear, objective, easy to use
- Cons: No gradations, misses quality beyond basics, can feel rigid
- Example: Pull request checklist (Tests pass? Code linted? Documentation updated? Security review?)
Pattern 5: Standards-Based Rubric
- Structure: Criteria tied to learning objectives/competencies, levels = degree of mastery
- Use case: Educational assessment, skill certification, training evaluation, criterion-referenced
- Pros: Aligned to standards, shows progress toward mastery, diagnostic
- Cons: Requires clear standards, can be complex to design
- Example: Data science skills (Proficiency in: Data cleaning, Modeling, Visualization, Communication × Novice/Competent/Expert)
Guardrails
-
Criteria should be observable and measurable: Not "good attitude" (subjective), but "arrives on time, volunteers for tasks, helps teammates" (observable). Test: Can two independent reviewers score this criterion consistently?
-
Descriptors should distinguish levels clearly: Each level needs concrete differences from adjacent levels. Avoid "5=very good, 4=good, 3=okay". Better: "5=zero bugs, meets all requirements, 4=1-2 minor bugs, meets 90% requirements."
-
Use appropriate scale granularity: 1-3 is too coarse, 1-10 is too fine. Sweet spot: 1-4 (forced choice, no middle) or 1-5 (allows neutral middle). Match granularity to actual observable differences.
-
Balance comprehensiveness with simplicity: Aim for 4-8 criteria covering essential quality dimensions. If >10 criteria, consider grouping or prioritizing.
-
Calibrate for inter-rater reliability: Have multiple reviewers score same work, measure agreement (Kappa, ICC). If <70% agreement, refine descriptors.
-
Provide examples at each level: Include concrete examples of work at each level (anchor papers, reference designs, code samples) to calibrate reviewers.
-
Share rubric before evaluation: If evaluatees see the rubric only after being scored, it is grading not guidance. Share upfront so people know expectations and can self-assess.
-
Weight criteria appropriately: If "Security" matters more than "Code style", weight it (Security x3, Style x1). Or use thresholds (score >=4 on Security to pass, regardless of other scores).
Common pitfalls:
- ❌ Subjective language: "Shows effort", "creative", "professional" - not observable without concrete descriptors
- ❌ Overlapping criteria: "Clarity" and "Organization" often conflated - define boundaries clearly
- ❌ Hidden expectations: Rubric doesn't mention X, but evaluators penalize for missing X - document all criteria
- ❌ Central tendency bias: Reviewers avoid extremes (always score 3/5) - use even-number scales (1-4) to force choice
- ❌ Halo effect: High score on one criterion biases other scores up - score each criterion independently before looking at others
- ❌ Rubric drift: Descriptors erode over time, reviewers interpret differently - periodic re-calibration required
Quick Reference
Key resources:
- resources/template.md: Purpose definition, criteria brainstorming, scale selection, descriptor templates, rubric formats
- resources/methodology.md: Scale design principles, descriptor writing techniques, inter-rater reliability testing, bias mitigation
- resources/evaluators/rubric_evaluation_rubrics.json: Quality criteria for rubric design (criteria clarity, scale appropriateness, descriptor specificity)
Scale Selection Guide:
| Scale | Use When | Pros | Cons |
|---|---|---|---|
| 1-3 | Need quick categorization, clear tiers | Fast, forces clear decision | Too coarse, less feedback |
| 1-4 | Want forced choice (no middle) | Avoids central tendency, clear differentiation | No neutral option, feels binary |
| 1-5 | General purpose, most common | Allows neutral, familiar, good granularity | Central tendency bias (everyone gets 3) |
| 1-10 | Need fine gradations, large sample | Maximum differentiation, statistical analysis | False precision, hard to distinguish adjacent levels |
| Qualitative (Novice/Proficient/Expert) | Educational, skill development | Intuitive, growth-oriented | Less quantitative, harder to aggregate |
| Binary (Yes/No, Pass/Fail) | Compliance, gatekeeping | Objective, simple | No gradations, misses quality differences |
Criteria Types:
- Product criteria: Evaluate the artifact itself (correctness, clarity, completeness, aesthetics, performance)
- Process criteria: How work was done (methodology followed, collaboration, iteration, time management)
- Impact criteria: Outcomes/effects (user satisfaction, business value, learning achieved)
- Meta criteria: Quality of quality (documentation, testability, maintainability, scalability)
Inter-Rater Reliability Benchmarks:
- <50% agreement: Rubric unreliable, needs major revision
- 50-70% agreement: Marginal, refine descriptors and calibrate reviewers
- 70-85% agreement: Good, acceptable for most uses
- >85% agreement: Excellent, highly reliable scoring
Typical Rubric Development Time:
- Simple rubric (3-5 criteria, 1-4 scale, known domain): 2-4 hours
- Standard rubric (5-7 criteria, 1-5 scale, some complexity): 6-10 hours + calibration session
- Complex rubric (8+ criteria, multiple scales, novel domain): 15-25 hours + multiple calibration rounds
When to escalate beyond rubrics:
- High-stakes decisions (hiring, admissions, awards) → Add structured interviews, portfolios, multi-method assessment
- Subjective/creative work (art, poetry, design) → Supplement rubric with critique, discourse, expert judgment
- Complex holistic judgment (leadership, cultural fit) → Rubrics help but don't capture everything, use thoughtfully → Rubrics are tools not replacements for human judgment. Use to structure thinking, not mechanize decisions.
Inputs required:
- Artifact type (what are we evaluating? essays, code, designs, proposals?)
- Criteria (quality dimensions to assess, 4-8 most common)
- Scale (1-5 default, or specify 1-4, 1-10, qualitative labels)
Outputs produced:
evaluation-rubrics.md: Purpose, criteria definitions, scale with descriptors, usage instructions, weighting/thresholds, calibration notes