validate-evaluator

Installation

SKILL.md

Validate Evaluator

Calibrate an LLM judge against human judgment.

Overview

Split human-labeled data into train (10-20%), dev (40-45%), test (40-45%)
Run judge on dev set and measure TPR/TNR
Iterate on the judge until TPR and TNR > 90% on dev set
Run once on held-out test set for final TPR/TNR
Apply bias correction formula to production data

Prerequisites

A built LLM judge prompt (from write-judge-prompt)
Human-labeled data: ~100 traces with binary Pass/Fail labels per failure mode
- Aim for ~50 Pass and ~50 Fail (balanced, even if real distribution is skewed)
- Labels must come from a domain expert, not outsourced annotators
Candidate few-shot examples from your labeled data

Installs

388

Repository

hamelsmu/evals-skills

GitHub Stars

1.4K

First Seen

Mar 3, 2026

Security Audits

Gen Agent Trust HubPass

validate-evaluator — hamelsmu/evals-skills