generate-judgements
Generate Judgements for Skill Evaluation
Analyze a skill's source files and produce fine-grained judge_definitions for the
mlflow-skills automated evaluation framework.
Each judgement is a yes/no question that an LLM judge answers by reading the execution trace.
Prerequisites
- Access to the target skill directory (must contain
SKILL.md) - Familiarity with the mlflow-skills YAML config format (see
references/yaml-config-spec.md)
Workflow
digraph generate_judgements {
rankdir=TB;
node [shape=box];
collect [label="Phase 1\nCollect & Analyze Skill Files"];
infer [label="Phase 2\nInfer Scopes"];
confirm_scope [label="User confirms scopes" shape=diamond];
generate [label="Phase 3\nGenerate Judgements per Scope"];
present [label="Phase 4\nPresent to User"];
confirm_judge [label="User approves?" shape=diamond];
write [label="Phase 5\nWrite / Update YAML"];
collect -> infer;
infer -> confirm_scope;
confirm_scope -> generate [label="approved"];
confirm_scope -> infer [label="revise"];
generate -> present;
present -> confirm_judge;
confirm_judge -> write [label="approved"];
confirm_judge -> generate [label="revise"];
}
Phase 1: Collect and Analyze Skill Files
Ask the user for two inputs (or auto-detect them):
- Skill directory path — the folder containing
SKILL.md - Existing test config YAML path (optional) — if provided, the tool will update
its
judge_definitionssection instead of creating a new file
Then read all available files in this order:
| Priority | File | Purpose |
|---|---|---|
| 1 | SKILL.md |
Primary source — workflow steps, behavior rules, output format |
| 2 | references/* |
Supporting details — templates, CLI commands, query patterns |
| 3 | README.md / README_CN.md |
Additional context — scope boundaries, limitations |
| 4 | Existing test config YAML | Understand current judgements to avoid duplication |
While reading, extract and note:
- Workflow steps — numbered steps the skill must follow
- Behavior rules — "must", "always", "never", "do not" directives
- Output format requirements — file naming, sections, tables, mandatory fields
- Conditional branches — if/else paths that lead to different outputs
- Important guidelines — the "Important Guidelines" or similar section at the end
Phase 2: Infer Scopes
Analyze the skill for distinct execution paths that produce different outputs or
follow different logic. Each distinct path becomes a scope.
How to identify scopes:
- Look for conditional branches in the workflow (e.g., "If X → do A; else → do B")
- Look for optional steps (e.g., "Only execute this step if...")
- Look for different output modes (e.g., "checklist only" vs "assessment report")
Scope naming rules:
- Use lowercase, single-word or hyphenated names:
checklist,assessment,research - The scope
allis reserved — it means "always run regardless of test_scope" - Every skill has at least the implicit
allscope for common/shared behavior
Present inferred scopes to the user with a brief description of each:
I found the following execution branches in this skill:
1. `all` — Common behavior shared across all paths
(skill loading, doc search, categorization, source annotations)
2. `checklist` — Checklist-only output path
(no live resource, generates checklist file, offers next steps)
3. `assessment` — Live assessment path
(runs AWS CLI, generates assessment report, no separate checklist)
Does this look right? Should I add, remove, or rename any scope?
Wait for user confirmation before proceeding.
Phase 3: Generate Judgements
For each confirmed scope, generate fine-grained judge_definitions. Follow these rules:
3.1 Granularity Principle
One check point per judgement. Each judgement tests exactly ONE behavior or requirement.
# GOOD — one specific check
- name: sequential-mcp-calls
scope: all
question: >
Check that MCP tool calls were executed sequentially...
# BAD — multiple checks crammed into one
- name: workflow-correct
scope: all
question: >
Check that the agent searched docs sequentially, read pages,
extracted items into 5 categories, and wrote the file...
3.2 Judgement Categories
Generate judgements in this order, for each scope:
Category A: Skill Loading & Invocation (scope: all)
- Was the skill loaded (SKILL.md read)?
- Were reference files read when needed?
Category B: Workflow Behavior (scope: all or scope-specific)
- Did each workflow step execute correctly?
- Were sequential/parallel execution rules followed?
- Were error handling / retry rules followed?
- Were conditional branches taken correctly?
Category C: Output Quality (scope: all or scope-specific)
- Does the output contain all mandatory sections/categories?
- Does the output follow the naming convention?
- Does the output include required metadata (source annotations, IDs, etc.)?
- Are quantities within expected ranges?
Category D: Scope-Specific Behavior (per non-all scope)
- What is unique to this execution path?
- What should NOT happen in this path? (negative checks)
- What additional output/actions are expected?
Category E: Guidelines Compliance (scope: all)
- Are "always/never/must" directives respected?
- Is the output language correct?
- Are edge cases handled?
3.3 Naming Convention
Use kebab-case names that describe the check:
skill-invoked — skill was loaded
sequential-mcp-calls — tool calls are sequential
doc-search-coverage — search queries cover required topics
five-categories-complete — output has all 5 categories
file-naming-convention — output file name matches pattern
aws-cli-commands-executed — CLI commands were run
no-separate-checklist-file — negative check: no extra file
3.4 Question Writing Rules
Each question field must be a self-contained instruction for the LLM judge. Follow
the patterns in references/judgement-patterns.md.
Required elements in every question:
- What to check — "Check that..." or "Verify that..."
- Where to look — "Look in the trace for...", "Look for tool calls..."
- Success criteria — specific, measurable condition for answering "yes"
- Leniency guidance (when appropriate) — "Be lenient...", "Answer 'yes' if at least..."
Important clarifications to include when relevant:
- Distinguish between parallel tool CALLS vs batched requests in one call
- Specify minimum thresholds (e.g., "at least 4 of 5", "roughly 30-50")
- Clarify what counts (e.g., "each URL in the requests array counts as a separate page")
- State default answer when evidence is ambiguous (e.g., "benefit of the doubt → yes")
3.5 Negative Checks
For each scope, also generate negative judgements — things that should NOT happen:
- In
checklistscope: assessment-only artifacts should NOT appear - In
assessmentscope: checklist-only artifacts should NOT appear - Across all scopes: forbidden behaviors (e.g., parallel calls when sequential is required)
Phase 4: Present Judgements to User
Present the generated judgements grouped by scope with clear section headers:
## Generated Judgements
### Scope: all (7 judgements)
| # | Name | Check |
|---|------|-------|
| 1 | skill-invoked | Skill was loaded from .claude/skills/ |
| 2 | sequential-mcp-calls | MCP calls are sequential, not parallel |
| ... | ... | ... |
### Scope: checklist (2 judgements)
| # | Name | Check |
|---|------|-------|
| 1 | file-naming-convention | Output file follows naming pattern |
| ... | ... | ... |
### Scope: assessment (8 judgements)
| ... | ... | ... |
Total: 17 judgements across 3 scopes.
Does this look right? Should I add, remove, or modify any judgement?
Wait for user confirmation. Iterate if the user requests changes.
Phase 5: Write / Update YAML
Once approved, write the output:
If an existing YAML config was provided:
Replace only the judge_definitions: section. Preserve all other fields (name,
prompt, skills, timeout_seconds, environment, etc.) exactly as they are.
Add the standard scope comment block above judge_definitions::
# ==============================================================
# Judge Definitions
#
# scope values:
# all — runs in all test scenarios
# {scope1} — only when test_scope={scope1}
# {scope2} — only when test_scope={scope2}
# ==============================================================
judge_definitions:
If no existing YAML was provided:
Generate a complete YAML config file. Ask the user for:
name— test run nameproject_dir— temp project directory nameprompt— default prompt for the testtest_scope— default scope to use
Use sensible defaults from the skill directory name for the rest. See
references/yaml-config-spec.md for the full config structure.
File naming: {skill-name}.yaml placed in the appropriate tests/configs/ directory.
After writing, inform the user of the file path and remind them:
- They can override
test_scopeandpromptfrom the CLI - Empty
environmentvalues won't override existing env vars - Judges with
scope: allalways run
Important Guidelines
- Be exhaustive: Extract every testable behavior from the skill. It's better to have too many judgements than to miss an important check. The user can always remove extras.
- One point per judgement: Never combine multiple checks. If you're tempted to use "and" in a question, split it into two judgements.
- Write for an LLM judge: The question will be answered by an LLM reading a raw MLflow trace (JSON with tool calls and responses). Be explicit about where to find evidence in the trace.
- Include thresholds: When the skill specifies numbers (e.g., "5 search queries", "30-50 items", "at least 3 per category"), encode those in the judgement.
- Respect language: Write judgement questions in English (they are consumed by an LLM judge, not shown to end users). But interact with the user in their language.
- Preserve existing work: When updating an existing YAML, review current judgements first. Keep well-written ones, improve weak ones, add missing ones.