Prompt Evaluator

A comprehensive framework for evaluating, comparing, and debugging system prompts for AI assistants.

CAPABILITIES:

Single prompt scoring: Evaluate a prompt across 15 dimensions with detailed scores, issues, and specific fix recommendations
Multi-prompt comparison: Compare 2-5 prompts side-by-side with ranking table and improvement roadmap
Feedback analysis: Analyze user feedback, bug reports, or improvement suggestions to determine if issues are prompt-related and provide actionable fixes

TRIGGERS - Use this skill when:

User asks to "evaluate", "score", "rate", or "review" a prompt
User asks to "compare prompts", "which prompt is better", or "A/B test prompts"
User provides feedback/complaints about AI behavior and asks if it's a prompt issue
User asks "what's wrong with this prompt", "how to improve this prompt"
User mentions "system prompt", "inner prompt", "built-in prompt", "assistant prompt"
User asks about prompt quality, prompt optimization, or prompt debugging
Keywords: prompt evaluation, prompt scoring, prompt comparison, prompt analysis, prompt review

OUTPUT: Structured evaluation report with scores, issues, fixes, and priority recommendations.

Quick Start

Mode 1: Single Prompt Evaluation

Input: One prompt (text or file)
Output: 15-dimension score card + prioritized fix recommendations

Mode 2: Multi-Prompt Comparison

Input: 2-5 prompts to compare
Output: Comparison table + winner analysis + improvement roadmap

Mode 3: Feedback Analysis

Input: User feedback/complaints + current prompt
Output: Root cause analysis + prompt-specific fixes (if applicable)

Evaluation Framework

15 Evaluation Dimensions

Load references/dimensions.md for the complete scoring rubric. Summary:

#	Dimension	Weight	Key Question
1	Role Definition	6%	Is the AI's identity and persona clearly defined?
2	Task Clarity	10%	Is the primary task unambiguous?
3	Constraint Completeness	10%	Are DO/DON'T rules comprehensive?
4	Output Format	8%	Is the expected output format specified?
5	Example Quality	8%	Are examples concrete and representative?
6	Edge Case Handling	10%	Are boundary conditions addressed?
7	Business Alignment	10%	Does it serve business goals?
8	User Experience	8%	Does it create good UX?
9	Safety & Ethics	8%	Are safety guardrails in place?
10	Maintainability	5%	Is it modular and easy to update?
11	Token Efficiency	4%	Is context used efficiently?
12	Robustness	5%	Is it resistant to misuse/injection?
13	Consistency	3%	Are rules internally consistent?
14	Internationalization	2%	Does it handle multiple languages?
15	Measurability	3%	Can outcomes be measured?

Mode 1: Single Prompt Evaluation

Input Requirements

The prompt to evaluate (as text, file, or pasted content)
Optional: Context about the product/use case
Optional: Known issues or user complaints

Evaluation Process

STEP 1: Initial Scan

Count total lines and estimate tokens
Identify structural elements (sections, headers, examples)
Detect language(s) used

STEP 2: Dimension-by-Dimension Scoring For each of the 15 dimensions:

Score 1-10 based on rubric in references/dimensions.md
Identify specific issues (quote line numbers)
Classify severity: 🔴 Critical / 🟡 Warning / 🟢 Minor
Provide concrete fix recommendation

STEP 3: Calculate Weighted Score

Total Score = Σ (dimension_score × weight) × 10

STEP 4: Generate Report

Output format:

# Prompt Evaluation Report

## Summary
| Metric | Value |
|--------|-------|
| Total Score | XX/100 (Grade) |
| Lines | XXX |
| Est. Tokens | ~X,XXX |
| Critical Issues | X |
| Warnings | X |

## Score Breakdown
[15-dimension table with scores and brief notes]

## Critical Issues (🔴)
[Detailed issues with line references and fixes]

## Warnings (🟡)
[Detailed issues with line references and fixes]

## Top 3 Priority Fixes
1. [Most impactful fix with before/after example]
2. [Second priority fix]
3. [Third priority fix]

## Improvement Roadmap
[Ordered list of all recommended changes]

Mode 2: Multi-Prompt Comparison

Input Requirements

2-5 prompts to compare (labeled A, B, C, etc.)
Optional: Evaluation focus (e.g., "focus on safety" or "focus on UX")

Comparison Process

STEP 1: Individual Evaluation Evaluate each prompt using Mode 1 (abbreviated)

STEP 2: Head-to-Head Comparison For each dimension, compare all prompts and identify:

Winner for that dimension
Key differentiator

STEP 3: Generate Comparison Report

Output format:

# Prompt Comparison Report

## Overall Ranking
| Rank | Prompt | Score | Strengths | Weaknesses |
|------|--------|-------|-----------|------------|
| 1 | [Name] | XX/100 | ... | ... |
| 2 | [Name] | XX/100 | ... | ... |

## Dimension-by-Dimension Comparison
| Dimension | Prompt A | Prompt B | Prompt C | Winner |
|-----------|----------|----------|----------|--------|
| Role Definition | 7 | 8 | 6 | B |
| Task Clarity | 9 | 7 | 8 | A |
| ... | ... | ... | ... | ... |

## Key Differentiators
[What makes the winner better in specific areas]

## Synthesis Recommendation
[How to combine the best elements of each prompt]

## Next Version Roadmap
[Prioritized improvements for the winning prompt]

Mode 3: Feedback Analysis

Input Requirements

User feedback, complaints, or bug reports
Current prompt being used
Optional: Conversation logs showing the issue

Analysis Process

STEP 1: Classify Feedback Determine if the issue is:

🎯 Prompt Issue: Fixable by modifying the prompt
⚙️ Backend Issue: Requires code/data/infrastructure changes
🔄 Hybrid Issue: Needs both prompt and backend fixes
❌ Not an Issue: User misunderstanding or expected behavior

STEP 2: Root Cause Analysis If prompt-related:

Identify which dimension(s) are failing
Locate specific rule gaps or conflicts
Trace the failure path

STEP 3: Generate Analysis Report

Output format:

# Feedback Analysis Report

## Feedback Summary
| Item | Details |
|------|---------|
| Feedback Type | [Complaint/Bug/Suggestion] |
| Issue Classification | [Prompt/Backend/Hybrid/Not an Issue] |
| Confidence | [High/Medium/Low] |

## Root Cause Analysis
[Detailed explanation of why the issue occurs]

## Is This a Prompt Issue?
[YES/NO/PARTIAL with reasoning]

## Affected Dimensions
| Dimension | Current Score | Impact |
|-----------|---------------|--------|
| [Dimension] | X/10 | [How feedback relates] |

## Recommended Prompt Fixes
[If prompt-related, provide specific fixes with before/after]

## Backend Recommendations
[If backend-related, describe what needs to change]

## Validation Criteria
[How to verify the fix works]

Grading Scale

Score Range	Grade	Description
90-100	A	Production-ready, minimal issues
80-89	B+	Good quality, minor improvements needed
70-79	B	Functional, several improvements recommended
60-69	C	Needs significant work
50-59	D	Major issues, not recommended for production
<50	F	Fundamental problems, requires rewrite

Common Anti-Patterns

When evaluating, watch for these red flags:

Vague Role: "You are a helpful assistant" (no specificity)
Missing Constraints: No DO NOT rules
No Examples: Abstract rules without concrete demonstrations
Contradictory Rules: Conflicting instructions
Token Bloat: Unnecessary repetition or verbose explanations
No Safety Rails: Missing content/behavior boundaries
Hardcoded Values: Values that should be dynamic
No Fallback: Missing error/edge case handling
Monolithic Structure: No modular sections
Injection Vulnerability: No protection against prompt attacks

See references/anti-patterns.md for detailed examples and fixes.

Output Quality Standards

All evaluation outputs must:

Be Actionable: Every issue must have a specific fix recommendation
Include Evidence: Quote specific lines/sections from the prompt
Prioritize: Rank issues by impact (Critical > Warning > Minor)
Provide Before/After: Show concrete examples of recommended changes
Be Measurable: Include expected improvement metrics where possible

prompt-evaluator

Prompt Evaluator

Quick Start

Mode 1: Single Prompt Evaluation

Mode 2: Multi-Prompt Comparison

Mode 3: Feedback Analysis

Evaluation Framework

15 Evaluation Dimensions

Mode 1: Single Prompt Evaluation

Input Requirements

Evaluation Process

Mode 2: Multi-Prompt Comparison

Input Requirements

Comparison Process

Mode 3: Feedback Analysis

Input Requirements

Analysis Process

Grading Scale

Common Anti-Patterns

Output Quality Standards