Write LLM-as-Judge Prompt

Design a binary Pass/Fail LLM-as-Judge evaluator for one specific failure mode. Each judge checks exactly one thing.

Prerequisites

Error analysis is complete. The failure mode is identified.
You have human-labeled traces for this failure mode (at least 20 Pass and 20 Fail examples).
A code-based evaluator cannot check this failure mode. Exhaust code-based options before reaching for a judge — many failure modes that seem subjective reduce to keyword checks, regex, or API calls when you understand the domain. Example: detecting whether an AI interviewing coach suggests "general" questions (asking about typical behavior instead of a specific past event) seems to require semantic understanding, but in practice a keyword check for words like "usually," "typical," and "normally" could work quite well.

Every judge prompt requires exactly four components:

State what the judge evaluates. One failure mode per judge.