prompt-reviewer
Prompt Reviewer
Evaluate user prompting quality across AI coding tools. Supports Claude Code, Codex, AMP, OpenCode, and any other tool.
Modes
Modes customize scoring and review behavior for specific teams or projects — axis weights, review cadence, output format preferences, and team context that affects expectations. Stored in modes/ (gitignored, never committed).
How Modes Work
Each mode is a markdown file: modes/{project-name}.md. It contains team-specific configuration: scoring weight adjustments (e.g., weight Autonomy higher for junior developers), review cadence, custom session source paths, and output format preferences.
Mode Selection (Step 0)
- List
.mdfiles inmodes/(if directory exists) - Each mode file has a
cwd_matchfield — a path prefix to match against cwd - If cwd matches exactly one mode, use it automatically
- If cwd matches multiple or none, ask the user which mode (or use default scoring)
- If
modes/doesn't exist, use standard scoring with no adjustments
Creating a Mode
Copy references/mode-template.md to modes/{project-name}.md and fill in scoring adjustments, team context, and output preferences. When a user runs the skill with no matching mode, offer to create one.
Modes are gitignored — they contain team-specific preferences that should not be committed to the skill repo.
Workflow Overview
Review flow (default):
- Gather parameters (time range, provider, scope)
- Extract sessions (script for Claude/Codex, manual for others)
- Score prompts (9 axes, 23 points)
- Output quick summary (default) → offer full scorecard
- Save scores to history (with provider/model metadata)
- Show trend if history exists
Trend-only flow (when user asks for "trend", "progress", "history"):
- Run show_trend.py and display results (optionally filter by provider)
- Skip review steps
Backfill flow (when user asks to "backfill", "backfill next", "list weeks", "catch up"):
- Run
list_weeks.pyto see available weeks and backfill status - If user said "backfill next [provider]" → immediately run the full review for that provider's next unreviewed week (no need to ask)
- If user just said "list weeks" → show the table and stop
- After review, ask about purging, then offer to continue with next week
Step 1: Gather Parameters
Ask the user with AskUserQuestion:
Provider (determines extraction method)
- Claude Code (auto-extract from ~/.claude sessions)
- Codex (auto-extract from ~/.codex sessions)
- OpenCode (auto-extract from ~/.local/state/opencode — single batch, no timestamps)
- AMP (manual — score current/pasted conversation)
- Other (manual)
Time range (default: today, for auto-extract providers only)
- Today (recommended)
- Past week
- Past month
- All time
Source (when provider is auto-extractable)
- All three (Claude Code + Codex + OpenCode)
- Both Claude Code and Codex (recommended)
- Claude Code only
- Codex only
- OpenCode only
Model (optional, for trend metadata)
- Ask what model they were using, or infer from context
- Examples: opus, sonnet, gpt-4o, o3, gemini-2.5-pro
Scope (Claude Code only)
- Current project only
- All projects
Step 2: Extract Sessions
For Claude Code / Codex / OpenCode (auto-extract):
python3 {skill_dir}/scripts/extract_sessions.py \
--source {claude|codex|opencode|both|all} \
--since {date} \
--project {project_path_if_filtered} \
--limit 20
Source options: both = claude + codex, all = claude + codex + opencode
OpenCode limitation: OpenCode stores prompts without timestamps or session boundaries. All prompts are returned as a single batch using the file's mtime for date filtering. Backfill-by-week is not meaningful for OpenCode.
For AMP / Other (manual):
No extraction script needed. Instead:
- Ask the user to paste conversation excerpts, share a session file, or point to a log
- If the user says "review this session" or "review my last conversation", score the prompts visible in the current conversation context
- Score whatever user prompts are available — the rubric works the same regardless of source
Step 3: Score Prompts
Read references/scoring-rubric.md for detailed criteria.
Score each session on 9 axes:
| Axis | Max | What to look for |
|---|---|---|
| Clarity | 3 | Named files, explicit goals, validation steps |
| Context & Inputs | 3 | Pasted errors, file paths, reproduction steps |
| Autonomy & Scope | 2 | "Only touch X", "Don't modify Y", stop conditions |
| Constraint Handling | 2 | "Run tests first", "Ask before deploying" |
| Iterative Checkpoints | 2 | "Pause after X", "Show me before continuing" |
| Follow-up Efficiency | 3 | New context vs bare "continue" or "yes" |
| Collaboration Traits | 3 | Constructive feedback, respectful redirects |
| Adaptability | 2 | Pivots based on discoveries |
| Outcome Alignment | 3 | Clear conclusion, explicit handoff |
Composite = sum(scores) / 23
Step 4: Output
Quick summary first → then offer full scorecard.
Quick Summary Template (default)
## Prompt Review - Quick Score
**Composite: X.XX / 1.00**
Sessions: N | Prompts: M
### Top 3 Improvements
1. **[Axis]** (X/max): [Coaching tip]
> Example: "[quoted prompt]"
> Try instead: "[improved version]"
2. ...
3. ...
### What's Working Well
- **[Axis]** (X/max): [Positive observation]
Full Scorecard Template (on request)
## Prompt Review - Full Scorecard
**Composite: X.XX / 1.00**
Sessions: N | Prompts: M | Source: Claude/Codex/Both
### Axis Breakdown
| Axis | Score | Evidence |
|------|-------|----------|
| Clarity | X/3 | "[quoted prompt]" |
| Context & Inputs | X/3 | "[quoted prompt]" |
| Autonomy & Scope | X/2 | "[quoted prompt]" |
| Constraint Handling | X/2 | "[quoted prompt]" |
| Iterative Checkpoints | X/2 | "[quoted prompt]" |
| Follow-up Efficiency | X/3 | "[quoted prompt]" |
| Collaboration Traits | X/3 | "[quoted prompt]" |
| Adaptability | X/2 | "[quoted prompt]" |
| Outcome Alignment | X/3 | "[quoted prompt]" |
### Superlatives
- **Prompt MVP**: "[best prompt]" - [why it worked]
- **Best Save**: "[course correction]" - [what it prevented]
- **Facepalm**: "[worst follow-up]" - [what to do instead]
- **Lowest Hanging Fruit**: [easiest fix with biggest impact]
### Coaching Plan
1. **[Priority improvement]**
> Before: "[current pattern]"
> After: "[improved pattern]"
2. ...
Step 5: Save Scores to History
After outputting the review, persist scores AND qualitative insights:
python3 {skill_dir}/scripts/save_review.py \
--composite {composite} --sessions {N} --prompts {M} \
--clarity {score} --context {score} --autonomy {score} \
--constraints {score} --checkpoints {score} --followup {score} \
--collaboration {score} --adaptability {score} --outcome {score} \
--source {source} --provider {provider} [--model {model}] [--project {path}] \
[--week YYYY-WNN] \
--improvements '[{"axis":"...","score":X.X,"tip":"...","example":"..."},...]' \
--strengths '[{"axis":"...","score":X.X,"observation":"..."},...]'
Improvements (Top 3): Each entry has axis, score, tip (coaching advice), example (quoted prompt).
Strengths (What's Working Well): Each entry has axis, score, observation.
--week (for backfills): Override the week recorded. Without this, saves use today's week.
Always run this after scoring. Scores accumulate in ~/.claude/prompt-review-history.jsonl.
Provider values: claude, codex, amp, opencode, other
Step 6: Show Trend
After saving, if history has 2+ reviews, show the trend:
python3 {skill_dir}/scripts/show_trend.py --weeks 8
Filter to a single provider:
python3 {skill_dir}/scripts/show_trend.py --weeks 8 --provider codex
CSV export (for spreadsheet charting):
python3 {skill_dir}/scripts/show_trend.py --csv --weeks 12
Trend Output Template
## Prompt Score Trend (2025-W01 to 2025-W04)
**Composite:** 0.78 (+0.09) ▃▄▅▆
| Week | Composite | Sessions | Prompts | Trend |
|------|-----------|----------|---------|-------|
| 2025-W01 | 0.69 | 8 | 42 | -- |
| 2025-W02 | 0.72 | 6 | 35 | +0.03 |
| 2025-W03 | 0.74 | 10 | 51 | +0.02 |
| 2025-W04 | 0.78 | 7 | 38 | +0.04 |
### Per-Axis Trends
| Axis | Max | Current | Trend | Spark |
|------|-----|---------|-------|-------|
| Clarity | 3 | 2.5 | +0.3 | ▄▅▅▆ |
| Context | 3 | 2.0 | +0.2 | ▃▃▄▅ |
| ... | ... | ... | ... | ... |
### Movers
- **Most improved:** Clarity (+0.3/3)
- **Needs attention:** Constraints (-0.2/2)
### By Provider
| Provider | Reviews | Avg Score | Spark |
|----------|---------|-----------|-------|
| claude | 12 | 0.76 | ▅▅▆▆ |
| codex | 5 | 0.68 | ▄▅▅ |
| amp | 3 | 0.72 | ▅▅▆ |
The "By Provider" section appears automatically when history contains multiple providers.
Backfill Workflow
To build trend history from past sessions, use the backfill flow.
Quick Backfill (recommended)
When user says "backfill next codex" or "backfill next claude":
- Run
list_weeks.pyto find the next unreviewed week for that provider - Immediately run extract_sessions.py for that week
- Score all prompts, output review, save with
--weekoverride - Ask about purging
- Offer: "Continue with next week?"
No questions needed — just do the next week.
Manual Backfill
If user wants to see status first or pick a specific week:
Step 1: List Available Weeks
python3 {skill_dir}/scripts/list_weeks.py
Or get the full backfill prompt directly:
python3 {skill_dir}/scripts/list_weeks.py --provider codex --prompt
Output shows all weeks with session data, which providers have data, and which have been reviewed:
## Session Weeks Available
| Week | Claude | Codex | OpenCode | Reviewed |
|------|--------|-------|----------|----------|
| 2025-W36 | — | 10 | — | — |
| 2025-W37 | — | 173 | — | — |
| 2025-W38 | — | 131 | — | codex |
| 2026-W05 | 656 | 3 | — | claude, codex |
### Next to Backfill
- **codex**: 2025-W36 (10 sessions)
### Backfill Command
python3 ~/.claude/skills/prompt-reviewer/scripts/extract_sessions.py \
--source codex \
--since 2025-09-01 \
--until 2025-09-07 \
--limit 100
Step 2: Review the Week
Run the suggested extract command, then score ALL prompts (not a sample). Use the standard review flow:
- Extract sessions for the week
- Score every user prompt on all 9 axes
- Output the review with week label (e.g., "2025-W36 Backfill")
- Save scores with
save_review.py
Step 3: Repeat
Run list_weeks.py again to see updated status and get the next week to review.
Backfill Prompt Template
For handing off a single week to another agent:
You are doing a prompt quality backfill review for week {WEEK} ({PROVIDER} sessions).
## Step 1: Extract sessions
Run:
python3 ~/.claude/skills/prompt-reviewer/scripts/extract_sessions.py \
--source {provider} \
--since {start_date} \
--until {end_date} \
--limit 100
## Step 2: Score all user prompts
Read ~/.claude/skills/prompt-reviewer/references/scoring-rubric.md.
Score EVERY user prompt (not a sample) on all 9 axes. Compute averages.
Composite = sum(axis averages) / 23
## Step 3: Output the review
## Prompt Review - {WEEK} ({PROVIDER} Backfill)
**Composite: X.XX / 1.00**
Sessions: N | Prompts: M | Provider: {provider}
### Top 3 Improvements
1. **[Axis]** (X.X/max): [Coaching tip]
> Example: "[quoted prompt]"
2. ...
3. ...
### What's Working Well
- **[Axis]** (X.X/max): [Positive observation]
## Step 4: Save scores (include qualitative insights!)
python3 ~/.claude/skills/prompt-reviewer/scripts/save_review.py \
--composite {composite} --sessions {N} --prompts {M} \
--clarity {avg} --context {avg} --autonomy {avg} \
--constraints {avg} --checkpoints {avg} --followup {avg} \
--collaboration {avg} --adaptability {avg} --outcome {avg} \
--source {provider} --provider {provider} --week {WEEK} \
--improvements '[{"axis":"...","score":X.X,"tip":"...","example":"..."},...]' \
--strengths '[{"axis":"...","score":X.X,"observation":"..."},...]'
IMPORTANT: Use --week {WEEK} to record the original week, not today's date.
## Step 5: Ask about purging old sessions (optional)
After saving, ask the user if they want to delete the reviewed session files to free disk space.
First preview (always use --dry-run first):
```bash
python3 {skill_dir}/scripts/purge_sessions.py \
--provider {provider} --week {WEEK} --dry-run
If user confirms, run without --dry-run:
python3 {skill_dir}/scripts/purge_sessions.py \
--provider {provider} --week {WEEK}
NEVER delete without asking first.
## Scoring Examples
### Clarity
**3/3** - Goal + file + validation:
> "Open `src/auth/login.ts`, confirm the OAuth callback matches the README spec, and stop before making changes."
**1/3** - Vague request:
> "make the transition look better"
**0/3** - No actionable request:
> "continue"
**Coaching**: Instead of "make it better", try "fade the transition over 0.5s with ease-out, and show me before applying"
### Context & Inputs
**3/3** - Error + file + steps:
> "Getting `TypeError: undefined is not a function` at line 42 of `utils.ts` when I run `npm test`. Here's the failing test output: [pasted]"
**1/3** - Missing details:
> "there's a bug in the auth flow"
**Coaching**: Paste the error message, specify which file, include reproduction steps
### Follow-up Efficiency
**3/3** - Adds new context:
> "Good, but the animation is too fast. Slow it to 300ms and add a subtle bounce at the end."
**0/3** - Filler:
> "yes"
**Coaching**: Instead of bare "yes", try "yes, and also check that it works on mobile"
## Scoring Philosophy
**Be constructive, not punitive.** The goal is improvement.
- Quote specific prompts as evidence
- Highlight what's working, not just problems
- Make coaching actionable ("Instead of X, try Y")
- Acknowledge context (quick questions don't need full briefs)