create-judge
Create Judge
Guide users through designing and creating an automated judge that evaluates LLM outputs in ZeroEval.
When To Use
- Creating a new judge from scratch for any evaluation goal.
- Deciding between binary (pass/fail) and scored (numeric rubric) evaluation.
- Writing judge templates (the evaluation prompt the judge model runs).
- Designing structured criteria for multi-dimensional scored judges.
- Linking a judge to a specific prompt for automatic feedback.
- Troubleshooting judges that aren't producing expected evaluations.
Execution Sequence
Follow these steps in order. Load reference files only when needed for the current step.
Step 1: Understand the Evaluation Goal
More from zeroeval/zeroeval-skills
manage-data
Create, load, push, version, and manage benchmark datasets with the ZeroEval Python SDK or git. Use when adding data to a benchmark, creating a dataset from code or CSV, pushing data to the backend, managing subsets, pulling existing benchmarks, converting data to Parquet, or setting up a git-based data workflow. Triggers on "add data", "create dataset", "push dataset", "upload data", "manage benchmark data", "dataset versioning", "subsets", "pull dataset", "parquet", "multimodal dataset".
16run-evals
Write tasks, evaluations, and scoring pipelines with the ZeroEval Python SDK. Covers defining @ze.task functions, running evals with dataset.eval(), writing row/column/run evaluators, scoring with column_map, emitting signals, configuring execution (workers, retries, checkpoints), repeating and resuming runs, and inspecting results. Triggers on "run evals", "write evaluation", "benchmark model", "score results", "evaluation pipeline", "task decorator", "scoring function", "column_map", "emit signal", "resume eval", "repeat eval".
16prompt-migration
This skill should be used when users want to migrate hardcoded prompts to ze.prompt for version tracking, feedback collection, judge linkage, and prompt optimization. It covers the full migration workflow for both Python and TypeScript. Triggers on "migrate prompt", "ze.prompt", "hardcoded prompt", "prompt migration", "send feedback", "prompt optimization", "wire feedback", or "connect judges to prompts".
11zeroeval-install
This skill should be used when users want to install, set up, or integrate ZeroEval into their AI application, agent, or pipeline. It covers SDK setup (Python and TypeScript), first-run tracing, ze.prompt migration, and judge recommendations. For non-SDK languages or direct API/OTLP ingestion it routes to the custom-tracing skill. Triggers on "install zeroeval", "set up zeroeval", "add tracing", "integrate zeroeval", "ze.prompt", "add judges", or "monitor my AI app".
10custom-tracing
This skill should be used when users want to send traces to ZeroEval without installing the SDK, using the REST API or OpenTelemetry (OTLP) directly. It covers direct HTTP span ingestion, OTLP collector configuration, and first-trace verification for any language. Triggers on "send traces via API", "direct API tracing", "custom tracing", "manual tracing", "without SDK", "unsupported language", "REST API tracing", "OTLP", "OpenTelemetry", or language cues like "Go", "Ruby", "Java", "Rust", "Elixir", or "PHP".
9