judgment-eval

Installation

SKILL.md

Judgment Evaluation Skill

Priorities

Realism (scenarios must be plausible) > Diagnostic Value (reveals actual judgment gaps) > Coverage (test multiple dimensions)

Reasoning: Unrealistic scenarios produce false signals. Diagnostic value ensures we learn from failures. Coverage prevents overfitting to a single dimension.

Goal

Generate scenario-based tests from an agent definition or system prompt, then guide interactive evaluation to identify judgment strengths, weaknesses, and prompt improvement opportunities.

Constraints

Interactive Evaluation Only: This skill guides manual evaluation in-conversation. Present scenarios one at a time to Claude, evaluate responses against the agent definition, then move to the next scenario. Do NOT attempt automated execution or batch processing.

Scenario Realism: Every scenario must be plausible in actual usage. Avoid contrived corner cases that would never occur in practice.

Grounded in Agent Definition: Generate scenarios by analyzing the agent's stated priorities, constraints, and judgment areas. Test what the agent claims to value, not generic "good judgment."

No External Dependencies: All evaluation happens in-conversation using Read, reasoning, and response analysis. No external tools, APIs, or execution environments.

Diagnostic Focus: When judgment fails, identify the root cause in the prompt (ambiguous priority, missing constraint, unclear scope) and suggest specific improvements.

Workflow

1. Intake

Accept agent definition or system prompt via $ARGUMENTS:

File path: Read the file to extract the agent definition
Pasted text: Parse directly

Extract:

Stated priorities: What the agent claims to optimize for
Hard constraints: Non-negotiable rules (e.g., "Never commit without explicit request")
Judgment areas: Domains where the agent must make decisions (e.g., "when to ask vs proceed")
Scope boundaries: What the agent is responsible for vs not

2. Analyze

Identify dimensions of judgment to test:

Priority conflicts: Where two stated priorities might compete
Scope ambiguity: Tasks that fall between defined responsibilities
Constraint edge cases: Situations where constraints might contradict
Escalation points: When the agent should stop vs proceed
Proportionality: Whether response scale matches issue severity

3. Generate Scenarios

For each dimension, create 2-3 scenarios following patterns from references/scenario-patterns.md:

Priority Conflicts: Present situations where two declared priorities compete directly. Force the agent to choose or reconcile.

Ambiguous Scope: Tasks that fall into gray areas of the agent's defined responsibilities.

Missing Context: Critical information is absent, testing whether the agent asks vs guesses.

Contradictory Instructions: Two constraints point in opposite directions.

Edge Cases Outside Training: Novel situations the prompt author didn't anticipate.

Escalation Judgment: When to stop and ask vs proceed with best guess.

Proportionality: Does response scale match the issue's severity?

4. Interactive Evaluation

For each scenario:

Present: Show the scenario to the user. Ask them to present it to Claude using the agent definition as context.
Capture Response: Have the user share Claude's response.
Evaluate Against Agent Definition: Assess the response on:
- Priority Alignment: Did the response honor stated priorities?
- Constraint Adherence: Were hard constraints followed?
- Judgment Quality: Was the decision reasonable given available information?
- Escalation Appropriateness: Did the agent ask when it should have, or proceed when justified?
Classify: Tag the response as:
- Good Judgment: Handled well with sound reasoning
- Surprising Judgment: Unexpected but defensible
- Failed Judgment: Violated stated priorities or constraints
Move to Next: Proceed to the next scenario.

5. Report

After all scenarios, summarize findings:

Good Judgment:

Scenarios handled well
What reasoning patterns worked
Why the agent succeeded (which prompt elements enabled this)

Surprising Judgment:

Unexpected but defensible responses
What priorities the agent implicitly prioritized
Whether this reveals a prompt gap or acceptable flexibility

Failed Judgment:

Responses that violated stated priorities/constraints
Root cause in the prompt (ambiguity, missing constraint, unclear priority)
Pattern analysis (do failures cluster around a specific dimension?)

Suggestions:

Specific prompt improvements based on failure patterns
Priority clarifications needed
Constraints to add
Scope boundaries to sharpen

Output

Format: Markdown report with sections:

# Judgment Evaluation Report

**Agent**: [agent name or file path]
**Date**: [date]
**Scenarios Tested**: [count]

## Summary

[1-2 sentences on overall judgment quality]

## Good Judgment (X scenarios)

### Scenario: [name]
**Response**: [brief summary]
**Why it worked**: [reasoning about what prompt elements enabled this]

## Surprising Judgment (X scenarios)

### Scenario: [name]
**Response**: [brief summary]
**Analysis**: [why unexpected, whether defensible, what it reveals]

## Failed Judgment (X scenarios)

### Scenario: [name]
**Response**: [brief summary]
**Failure Mode**: [what priority/constraint was violated]
**Root Cause**: [ambiguity/gap in prompt]

## Patterns

[Analysis of failure clusters and success patterns]

## Suggested Improvements

1. **[Prompt Section]**: [Specific change with reasoning]
2. **[Constraint to Add]**: [Why this prevents observed failures]
3. **[Priority Clarification]**: [How to resolve observed conflicts]

References

references/scenario-patterns.md - Catalog of scenario types with templates and evaluation criteria

Arguments

$ARGUMENTS accepts:

File path: Path to agent definition (*.md file with frontmatter or system prompt)
Pasted text: Agent definition text directly

If file path, Read the file. If pasted text, parse directly.

Example Usage

/judgment-eval ~/.claude/plugins/sdlc-plugin/agents/task-implementer.md

Or with pasted text:

/judgment-eval """
You are a task implementer agent.

## Priorities
Spec compliance > Working code > Clean code

## Constraints
- ONLY implement what the task requires
- Ask when unsure using QUESTION/CONTEXT/OPTIONS
...
"""

Notes

No automation: This skill does NOT execute scenarios in a test harness. It guides interactive evaluation.
Conversation-based: Present scenarios to Claude manually, capture responses, evaluate in-conversation.
Diagnostic, not pass/fail: Goal is to identify prompt improvement opportunities, not to "grade" the agent.
Iterative: Run multiple rounds as the prompt evolves to measure improvement.

Related skills

More from iamladi/cautious-computing-machine--sdlc-plugin

Installs

Repository

iamladi/cautiou…c-plugin

GitHub Stars

First Seen

Mar 29, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass