judgment-eval
Judgment Evaluation Skill
Priorities
Realism (scenarios must be plausible) > Diagnostic Value (reveals actual judgment gaps) > Coverage (test multiple dimensions)
Reasoning: Unrealistic scenarios produce false signals. Diagnostic value ensures we learn from failures. Coverage prevents overfitting to a single dimension.
Goal
Generate scenario-based tests from an agent definition or system prompt, then guide interactive evaluation to identify judgment strengths, weaknesses, and prompt improvement opportunities.
Constraints
Interactive Evaluation Only: This skill guides manual evaluation in-conversation. Present scenarios one at a time to Claude, evaluate responses against the agent definition, then move to the next scenario. Do NOT attempt automated execution or batch processing.
Scenario Realism: Every scenario must be plausible in actual usage. Avoid contrived corner cases that would never occur in practice.
Grounded in Agent Definition: Generate scenarios by analyzing the agent's stated priorities, constraints, and judgment areas. Test what the agent claims to value, not generic "good judgment."
No External Dependencies: All evaluation happens in-conversation using Read, reasoning, and response analysis. No external tools, APIs, or execution environments.
Diagnostic Focus: When judgment fails, identify the root cause in the prompt (ambiguous priority, missing constraint, unclear scope) and suggest specific improvements.
Workflow
1. Intake
Accept agent definition or system prompt via $ARGUMENTS:
- File path: Read the file to extract the agent definition
- Pasted text: Parse directly
Extract:
- Stated priorities: What the agent claims to optimize for
- Hard constraints: Non-negotiable rules (e.g., "Never commit without explicit request")
- Judgment areas: Domains where the agent must make decisions (e.g., "when to ask vs proceed")
- Scope boundaries: What the agent is responsible for vs not
2. Analyze
Identify dimensions of judgment to test:
- Priority conflicts: Where two stated priorities might compete
- Scope ambiguity: Tasks that fall between defined responsibilities
- Constraint edge cases: Situations where constraints might contradict
- Escalation points: When the agent should stop vs proceed
- Proportionality: Whether response scale matches issue severity
3. Generate Scenarios
For each dimension, create 2-3 scenarios following patterns from references/scenario-patterns.md:
Priority Conflicts: Present situations where two declared priorities compete directly. Force the agent to choose or reconcile.
Ambiguous Scope: Tasks that fall into gray areas of the agent's defined responsibilities.
Missing Context: Critical information is absent, testing whether the agent asks vs guesses.
Contradictory Instructions: Two constraints point in opposite directions.
Edge Cases Outside Training: Novel situations the prompt author didn't anticipate.
Escalation Judgment: When to stop and ask vs proceed with best guess.
Proportionality: Does response scale match the issue's severity?
4. Interactive Evaluation
For each scenario:
-
Present: Show the scenario to the user. Ask them to present it to Claude using the agent definition as context.
-
Capture Response: Have the user share Claude's response.
-
Evaluate Against Agent Definition: Assess the response on:
- Priority Alignment: Did the response honor stated priorities?
- Constraint Adherence: Were hard constraints followed?
- Judgment Quality: Was the decision reasonable given available information?
- Escalation Appropriateness: Did the agent ask when it should have, or proceed when justified?
-
Classify: Tag the response as:
- Good Judgment: Handled well with sound reasoning
- Surprising Judgment: Unexpected but defensible
- Failed Judgment: Violated stated priorities or constraints
-
Move to Next: Proceed to the next scenario.
5. Report
After all scenarios, summarize findings:
Good Judgment:
- Scenarios handled well
- What reasoning patterns worked
- Why the agent succeeded (which prompt elements enabled this)
Surprising Judgment:
- Unexpected but defensible responses
- What priorities the agent implicitly prioritized
- Whether this reveals a prompt gap or acceptable flexibility
Failed Judgment:
- Responses that violated stated priorities/constraints
- Root cause in the prompt (ambiguity, missing constraint, unclear priority)
- Pattern analysis (do failures cluster around a specific dimension?)
Suggestions:
- Specific prompt improvements based on failure patterns
- Priority clarifications needed
- Constraints to add
- Scope boundaries to sharpen
Output
Format: Markdown report with sections:
# Judgment Evaluation Report
**Agent**: [agent name or file path]
**Date**: [date]
**Scenarios Tested**: [count]
## Summary
[1-2 sentences on overall judgment quality]
## Good Judgment (X scenarios)
### Scenario: [name]
**Response**: [brief summary]
**Why it worked**: [reasoning about what prompt elements enabled this]
## Surprising Judgment (X scenarios)
### Scenario: [name]
**Response**: [brief summary]
**Analysis**: [why unexpected, whether defensible, what it reveals]
## Failed Judgment (X scenarios)
### Scenario: [name]
**Response**: [brief summary]
**Failure Mode**: [what priority/constraint was violated]
**Root Cause**: [ambiguity/gap in prompt]
## Patterns
[Analysis of failure clusters and success patterns]
## Suggested Improvements
1. **[Prompt Section]**: [Specific change with reasoning]
2. **[Constraint to Add]**: [Why this prevents observed failures]
3. **[Priority Clarification]**: [How to resolve observed conflicts]
References
references/scenario-patterns.md- Catalog of scenario types with templates and evaluation criteria
Arguments
$ARGUMENTS accepts:
- File path: Path to agent definition (*.md file with frontmatter or system prompt)
- Pasted text: Agent definition text directly
If file path, Read the file. If pasted text, parse directly.
Example Usage
/judgment-eval ~/.claude/plugins/sdlc-plugin/agents/task-implementer.md
Or with pasted text:
/judgment-eval """
You are a task implementer agent.
## Priorities
Spec compliance > Working code > Clean code
## Constraints
- ONLY implement what the task requires
- Ask when unsure using QUESTION/CONTEXT/OPTIONS
...
"""
Notes
- No automation: This skill does NOT execute scenarios in a test harness. It guides interactive evaluation.
- Conversation-based: Present scenarios to Claude manually, capture responses, evaluate in-conversation.
- Diagnostic, not pass/fail: Goal is to identify prompt improvement opportunities, not to "grade" the agent.
- Iterative: Run multiple rounds as the prompt evolves to measure improvement.
More from iamladi/cautious-computing-machine--sdlc-plugin
codex
Use when the user asks to run Codex CLI (codex exec, codex resume) or references OpenAI Codex for code analysis, refactoring, or automated editing. Resolves the latest flagship model from the model registry.
9gemini
Use when the user asks to run Gemini CLI for code review, plan review, or big context (>200k) processing. Ideal for comprehensive analysis requiring large context windows. Resolves the latest flagship model from the model registry.
7interview
Interview me about anything in depth
7tdd
TDD enforcement during implementation. Reads `tdd:` setting from CLAUDE.md. Modes - strict (human approval for escape), soft (warnings), off (disabled). Auto-invoked by /implement.
6x-search
Search X/Twitter for real-time developer discourse, product feedback, community sentiment, and expert opinions. Use when user says "x search", "search x for", "search twitter for", "what are people saying about", or needs recent X discourse for context (library releases, API changes, product launches, industry discussion). Also use when researching a library, framework, API, or product to supplement web search with real-time community signal — e.g. "research Bun", "what do devs think of Hono", "is Turso production-ready".
1update-models
Re-resolve the model registry by querying OpenAI Codex cache, Google AI API, and Oracle CLI. Use when models feel stale or after a major model release.
1