prompt-evaluator
Prompt Evaluator
A comprehensive framework for evaluating, comparing, and debugging system prompts for AI assistants.
CAPABILITIES:
- Single prompt scoring: Evaluate a prompt across 15 dimensions with detailed scores, issues, and specific fix recommendations
- Multi-prompt comparison: Compare 2-5 prompts side-by-side with ranking table and improvement roadmap
- Feedback analysis: Analyze user feedback, bug reports, or improvement suggestions to determine if issues are prompt-related and provide actionable fixes
TRIGGERS - Use this skill when:
- User asks to "evaluate", "score", "rate", or "review" a prompt
- User asks to "compare prompts", "which prompt is better", or "A/B test prompts"
- User provides feedback/complaints about AI behavior and asks if it's a prompt issue
- User asks "what's wrong with this prompt", "how to improve this prompt"
- User mentions "system prompt", "inner prompt", "built-in prompt", "assistant prompt"
- User asks about prompt quality, prompt optimization, or prompt debugging
- Keywords: prompt evaluation, prompt scoring, prompt comparison, prompt analysis, prompt review
OUTPUT: Structured evaluation report with scores, issues, fixes, and priority recommendations.
Quick Start
Mode 1: Single Prompt Evaluation
Input: One prompt (text or file)
Output: 15-dimension score card + prioritized fix recommendations
Mode 2: Multi-Prompt Comparison
Input: 2-5 prompts to compare
Output: Comparison table + winner analysis + improvement roadmap
Mode 3: Feedback Analysis
Input: User feedback/complaints + current prompt
Output: Root cause analysis + prompt-specific fixes (if applicable)
Evaluation Framework
15 Evaluation Dimensions
Load references/dimensions.md for the complete scoring rubric. Summary:
| # | Dimension | Weight | Key Question |
|---|---|---|---|
| 1 | Role Definition | 6% | Is the AI's identity and persona clearly defined? |
| 2 | Task Clarity | 10% | Is the primary task unambiguous? |
| 3 | Constraint Completeness | 10% | Are DO/DON'T rules comprehensive? |
| 4 | Output Format | 8% | Is the expected output format specified? |
| 5 | Example Quality | 8% | Are examples concrete and representative? |
| 6 | Edge Case Handling | 10% | Are boundary conditions addressed? |
| 7 | Business Alignment | 10% | Does it serve business goals? |
| 8 | User Experience | 8% | Does it create good UX? |
| 9 | Safety & Ethics | 8% | Are safety guardrails in place? |
| 10 | Maintainability | 5% | Is it modular and easy to update? |
| 11 | Token Efficiency | 4% | Is context used efficiently? |
| 12 | Robustness | 5% | Is it resistant to misuse/injection? |
| 13 | Consistency | 3% | Are rules internally consistent? |
| 14 | Internationalization | 2% | Does it handle multiple languages? |
| 15 | Measurability | 3% | Can outcomes be measured? |
Mode 1: Single Prompt Evaluation
Input Requirements
- The prompt to evaluate (as text, file, or pasted content)
- Optional: Context about the product/use case
- Optional: Known issues or user complaints
Evaluation Process
STEP 1: Initial Scan
- Count total lines and estimate tokens
- Identify structural elements (sections, headers, examples)
- Detect language(s) used
STEP 2: Dimension-by-Dimension Scoring For each of the 15 dimensions:
- Score 1-10 based on rubric in
references/dimensions.md - Identify specific issues (quote line numbers)
- Classify severity: 🔴 Critical / 🟡 Warning / 🟢 Minor
- Provide concrete fix recommendation
STEP 3: Calculate Weighted Score
Total Score = Σ (dimension_score × weight) × 10
STEP 4: Generate Report
Output format:
# Prompt Evaluation Report
## Summary
| Metric | Value |
|--------|-------|
| Total Score | XX/100 (Grade) |
| Lines | XXX |
| Est. Tokens | ~X,XXX |
| Critical Issues | X |
| Warnings | X |
## Score Breakdown
[15-dimension table with scores and brief notes]
## Critical Issues (🔴)
[Detailed issues with line references and fixes]
## Warnings (🟡)
[Detailed issues with line references and fixes]
## Top 3 Priority Fixes
1. [Most impactful fix with before/after example]
2. [Second priority fix]
3. [Third priority fix]
## Improvement Roadmap
[Ordered list of all recommended changes]
Mode 2: Multi-Prompt Comparison
Input Requirements
- 2-5 prompts to compare (labeled A, B, C, etc.)
- Optional: Evaluation focus (e.g., "focus on safety" or "focus on UX")
Comparison Process
STEP 1: Individual Evaluation Evaluate each prompt using Mode 1 (abbreviated)
STEP 2: Head-to-Head Comparison For each dimension, compare all prompts and identify:
- Winner for that dimension
- Key differentiator
STEP 3: Generate Comparison Report
Output format:
# Prompt Comparison Report
## Overall Ranking
| Rank | Prompt | Score | Strengths | Weaknesses |
|------|--------|-------|-----------|------------|
| 1 | [Name] | XX/100 | ... | ... |
| 2 | [Name] | XX/100 | ... | ... |
## Dimension-by-Dimension Comparison
| Dimension | Prompt A | Prompt B | Prompt C | Winner |
|-----------|----------|----------|----------|--------|
| Role Definition | 7 | 8 | 6 | B |
| Task Clarity | 9 | 7 | 8 | A |
| ... | ... | ... | ... | ... |
## Key Differentiators
[What makes the winner better in specific areas]
## Synthesis Recommendation
[How to combine the best elements of each prompt]
## Next Version Roadmap
[Prioritized improvements for the winning prompt]
Mode 3: Feedback Analysis
Input Requirements
- User feedback, complaints, or bug reports
- Current prompt being used
- Optional: Conversation logs showing the issue
Analysis Process
STEP 1: Classify Feedback Determine if the issue is:
- 🎯 Prompt Issue: Fixable by modifying the prompt
- ⚙️ Backend Issue: Requires code/data/infrastructure changes
- 🔄 Hybrid Issue: Needs both prompt and backend fixes
- ❌ Not an Issue: User misunderstanding or expected behavior
STEP 2: Root Cause Analysis If prompt-related:
- Identify which dimension(s) are failing
- Locate specific rule gaps or conflicts
- Trace the failure path
STEP 3: Generate Analysis Report
Output format:
# Feedback Analysis Report
## Feedback Summary
| Item | Details |
|------|---------|
| Feedback Type | [Complaint/Bug/Suggestion] |
| Issue Classification | [Prompt/Backend/Hybrid/Not an Issue] |
| Confidence | [High/Medium/Low] |
## Root Cause Analysis
[Detailed explanation of why the issue occurs]
## Is This a Prompt Issue?
[YES/NO/PARTIAL with reasoning]
## Affected Dimensions
| Dimension | Current Score | Impact |
|-----------|---------------|--------|
| [Dimension] | X/10 | [How feedback relates] |
## Recommended Prompt Fixes
[If prompt-related, provide specific fixes with before/after]
## Backend Recommendations
[If backend-related, describe what needs to change]
## Validation Criteria
[How to verify the fix works]
Grading Scale
| Score Range | Grade | Description |
|---|---|---|
| 90-100 | A | Production-ready, minimal issues |
| 80-89 | B+ | Good quality, minor improvements needed |
| 70-79 | B | Functional, several improvements recommended |
| 60-69 | C | Needs significant work |
| 50-59 | D | Major issues, not recommended for production |
| <50 | F | Fundamental problems, requires rewrite |
Common Anti-Patterns
When evaluating, watch for these red flags:
- Vague Role: "You are a helpful assistant" (no specificity)
- Missing Constraints: No DO NOT rules
- No Examples: Abstract rules without concrete demonstrations
- Contradictory Rules: Conflicting instructions
- Token Bloat: Unnecessary repetition or verbose explanations
- No Safety Rails: Missing content/behavior boundaries
- Hardcoded Values: Values that should be dynamic
- No Fallback: Missing error/edge case handling
- Monolithic Structure: No modular sections
- Injection Vulnerability: No protection against prompt attacks
See references/anti-patterns.md for detailed examples and fixes.
Output Quality Standards
All evaluation outputs must:
- Be Actionable: Every issue must have a specific fix recommendation
- Include Evidence: Quote specific lines/sections from the prompt
- Prioritize: Rank issues by impact (Critical > Warning > Minor)
- Provide Before/After: Show concrete examples of recommended changes
- Be Measurable: Include expected improvement metrics where possible