sc-evaluate
LLM Evaluation Skill
Run LLM pipeline evaluation against gold standard datasets using oracle LLM-as-judge scoring. Measures output quality across weighted dimensions, identifies weak steps, and suggests prompt improvements.
Quick Start
# Full evaluation (all test cases, all steps)
/sc:evaluate
# Quick spot check
/sc:evaluate --cases=case_1,case_2 --steps=1,2,3
# Re-evaluate existing results without re-running pipeline
/sc:evaluate --skip-pipeline
# Generate outputs only (no evaluation)
/sc:evaluate --skip-eval
# Specify judge model
/sc:evaluate --judge-model=gpt-4o
# Dry run to preview plan
/sc:evaluate --dry-run
Behavioral Flow
- Discover - Find evaluation script, gold standards, and prompt files
- Configure - Parse scope (cases, steps, model overrides)
- Execute - Run pipeline on gold standard inputs
- Evaluate - Score outputs against gold standards via LLM-as-judge
- Analyze - Identify weak steps, dimension breakdowns, patterns
- Recommend - Suggest specific prompt improvements for low-scoring steps
- Report - Generate JSON + Markdown evaluation reports
Flags
| Flag | Type | Default | Description |
|---|---|---|---|
--cases |
string | all | Comma-separated test case IDs to evaluate |
--steps |
string | all | Comma-separated step numbers to evaluate |
--model |
string | env default | Override pipeline model |
--judge-model |
string | env default | Override judge/oracle model |
--skip-pipeline |
bool | false | Skip pipeline execution, evaluate existing results |
--skip-eval |
bool | false | Run pipeline only, skip evaluation |
--dry-run |
bool | false | Preview execution plan without API calls |
--output |
string | eval_runs/YYYYMMDD_HHMMSS/ |
Output directory |
--concurrency |
int | 5 | Parallel judge calls |
--threshold |
int | 70 | Score threshold for "needs improvement" |
Phase 1: Discover Project Structure
Locate evaluation components:
| Component | Common Locations | Purpose |
|---|---|---|
| Evaluation script | scripts/run_eval.py, eval/run.py |
Orchestrates pipeline + scoring |
| Gold standards | gold_standards/, test_data/, fixtures/ |
Expected outputs |
| Prompts | prompts/, templates/ |
Pipeline prompt templates |
| Rubrics | eval/rubrics.py, config/rubrics.yaml |
Scoring dimensions and weights |
If no standard structure found, ask the user to specify paths.
Phase 2: Configure Scope
Parse arguments to determine:
- Which test cases to run (default: all discovered)
- Which pipeline steps to evaluate (default: all)
- Model overrides for pipeline and judge
- Output directory (default: timestamped)
Create output directory:
OUTPUT_DIR="${output:-eval_runs/$(date +%Y%m%d_%H%M%S)}"
mkdir -p "$OUTPUT_DIR"
Phase 3: Execute Pipeline
Run the pipeline on gold standard inputs:
python <eval_script> \
--output "$OUTPUT_DIR" \
--verbose \
[--cases CASES] \
[--steps STEPS] \
[--model MODEL] \
[--skip-pipeline] \
[--skip-eval]
API call estimation:
- Pipeline: steps x cases API calls
- Evaluation: scored_dimensions x cases judge calls
For quick validation, suggest running on 1-2 cases with 2-3 steps first.
Phase 4: Evaluate with LLM-as-Judge
For each step output, compare against gold standard using oracle LLM-as-judge:
Evaluation dimensions (customizable per project):
| Dimension | What It Measures |
|---|---|
| Content Agreement | Do outputs cover the same key points? |
| Structure Match | Is the organization/format similar? |
| Detail Accuracy | Are specific claims and data correct? |
| Completeness | Are all expected elements present? |
Each dimension has a weight (0.0-1.0) summing to 1.0 per step.
Phase 5: Analyze Results
Read and analyze evaluation report:
- Overall similarity score across all cases and steps
- Per-step scores — highlight any below threshold (default: 70/100)
- Per-case scores — identify consistently weak test cases
- Dimension breakdowns for weak steps
Score interpretation:
| Score Range | Assessment | Action |
|---|---|---|
| 85-100 | Excellent | No changes needed |
| 70-84 | Good | Minor tuning possible |
| 60-69 | Needs improvement | Prompt revision recommended |
| Below 60 | Poor | Prompt likely needs rewrite |
Phase 6: Recommend Improvements
For each step scoring below threshold:
- Read the current prompt template
- Read the gold standard output (expected)
- Read the pipeline output (actual)
- Compare and identify gaps:
- Missing instructions that gold standard captures
- Overly broad instructions causing divergent output
- Format/structure differences
- Specificity gaps
Present actionable suggestions:
### Step N: <step_name> (Score: XX/100)
**Weakest Dimension**: <dimension> (XX/100)
**Gap Analysis**:
- Gold standard includes <X> but prompt doesn't instruct it
- Output format diverges: gold uses <format>, output uses <other>
**Suggested Prompt Changes**:
1. Add instruction: "<specific instruction>"
2. Clarify format: "<format guidance>"
3. Add example: "<example output snippet>"
Output Structure
eval_runs/YYYYMMDD_HHMMSS/
results/ # Pipeline outputs
case_1/
step_01_<name>.md
step_02_<name>.md
...
case_2/
...
evaluation/ # Judge scores
evaluation_report.json
evaluation_report.md
per_step_scores.csv
per_case_scores.csv
MCP Integration
PAL MCP (Optional)
| Tool | When | Purpose |
|---|---|---|
mcp__pal__thinkdeep |
Low-scoring steps | Deep analysis of why outputs diverge |
mcp__pal__consensus |
Prompt revision | Multi-model validation of proposed changes |
mcp__pal__codereview |
Eval script | Review evaluation pipeline code |
Rube MCP (Optional)
| Tool | When | Purpose |
|---|---|---|
mcp__rube__RUBE_REMOTE_WORKBENCH |
Large eval runs | Process results in Python sandbox |
mcp__rube__RUBE_MULTI_EXECUTE_TOOL |
Notifications | Report results to Slack/email |
Error Handling
| Scenario | Action |
|---|---|
| No eval script found | Ask user for script path |
| No gold standards found | Ask user for gold standard directory |
| API rate limit | Reduce concurrency, add delays |
| Pipeline step fails | Log error, continue with remaining steps |
| Judge returns invalid score | Retry once, then flag for manual review |
| Output directory exists | Append timestamp suffix |
Guardrails
- Always pass
--verbosefor progress visibility - Warn about API call counts before full runs
- Suggest quick validation on subset before full evaluation
- Preserve all intermediate outputs for debugging
- Never modify gold standard files
Tool Coordination
- Bash - Run evaluation scripts
- Read - Inspect prompts, gold standards, outputs, reports
- Write - Generate reports
- Grep - Search for patterns in outputs
- PAL MCP - Deep analysis of score gaps