prompt-evaluator
Prompt Evaluator
Evaluate LLM prompts on a 100-point scale based on research findings from Thorgeirsson et al. (2026), which demonstrated that writing quality—specifically coherence, instructional clarity, and information content—significantly predicts LLM-assisted programming performance.
Key Research Insights
- Information content > vocabulary: Adding missing information improves results; rewording without adding information rarely helps (Lucchetti et al.)
- Structure matters: Unorganized, vague prompts lead to failure cycles
- Declarative > interrogative: Declarative statements outperform questions (Chen et al.)
- Ambiguity kills: Unclear pronouns, implicit assumptions, and missing constraints are top failure causes
Evaluation Workflow
- Receive the user's prompt
- Read references/evaluation-rubric.md for detailed scoring criteria
- Score each of the 5 axes (4 sub-items × 5pt = 20pt per axis, 100pt total)
- For common issues, consult references/improvement-patterns.md for Before/After examples
- Output the evaluation result using the template below
- Provide a revised prompt
5 Evaluation Axes
| # | Axis | Points | Focus |
|---|---|---|---|
| 1 | Clarity (明確性) | 20 | Unambiguous intent, no unclear references |
| 2 | Structure (構造) | 20 | Logical organization, appropriate segmentation |
| 3 | Information Content (情報量) | 20 | Sufficient detail for task completion |
| 4 | Specificity (特定性) | 20 | Concrete requirements, constraints, formats |
| 5 | Context (文脈提供) | 20 | Background, audience, purpose clearly stated |
Output Template
Use this exact template because a consistent format helps the user easily compare evaluations and understand the scoring breakdown:
## プロンプト評価結果 / Prompt Evaluation
### 対象プロンプト
> [quote the evaluated prompt here]
### スコア
| 軸 / Axis | スコア | 主な所見 |
|-----------|--------|----------|
| 明確性 (Clarity) | __/20 | ... |
| 構造 (Structure) | __/20 | ... |
| 情報量 (Info Content) | __/20 | ... |
| 特定性 (Specificity) | __/20 | ... |
| 文脈提供 (Context) | __/20 | ... |
| **合計 / Total** | **__/100** | |
### 評価の概要
[1-2 paragraph summary of strengths and weaknesses]
### 改善提案
1. [specific, actionable suggestion]
2. [specific, actionable suggestion]
...
### 改善版プロンプト
[the improved prompt in a code block]
Scoring Guidelines
- 0pt: Sub-item is completely absent or counterproductive
- 1pt: Minimal attempt, mostly insufficient
- 2-3pt: Partially addressed, room for improvement
- 4pt: Well addressed with minor gaps
- 5pt: Excellent, no meaningful improvement needed
Language Handling
- Evaluate prompts in any language (Japanese, English, etc.)
- Output the evaluation in the same language as the user's prompt
- Scoring criteria apply universally regardless of language
More from hrdtbs/agent-skills
plan-self-review
Self-evaluate a plan on a 100-point scale after it is created or updated. Make sure to use this skill immediately whenever you create a plan or update a plan, even if the user does not explicitly ask for a review. This skill ensures that the plan is clear, comprehensive, feasible, and consistent before execution.
45create-pull-request
Create a GitHub pull request safely and reliably using project conventions. Make sure to use this skill whenever the user asks to create a PR, submit changes for review, open a pull request, or mentions "PR", "プルリク", or "pull request". It handles commit verification, branch validation, and PR creation using the gh CLI.
40commit
Expert-level commit creation and formatting following Conventional Commits. Make sure to use this skill whenever you need to create a commit message, save changes to git, structure a logical commit history, or when the user mentions 'commit', 'git commit', 'コミット', '変更をコミット', or asks you to push their code.
39mcp-builder
Guide for creating high-quality MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. Use when building MCP servers to integrate external APIs or services, whether in Python (FastMCP) or Node/TypeScript (MCP SDK).
3skill-judge
Evaluate Agent Skill design quality against official specifications and best practices. Use when reviewing, auditing, or improving SKILL.md files and skill packages. Provides multi-dimensional scoring and actionable improvement suggestions.
3skill-creator
Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
3