sc-evaluate
LLM Evaluation Skill
Run LLM pipeline evaluation against gold standard datasets using oracle LLM-as-judge scoring. Measures output quality across weighted dimensions, identifies weak steps, and suggests prompt improvements.
Quick Start
# Full evaluation (all test cases, all steps)
/sc:evaluate
# Quick spot check
/sc:evaluate --cases=case_1,case_2 --steps=1,2,3
# Re-evaluate existing results without re-running pipeline
/sc:evaluate --skip-pipeline
# Generate outputs only (no evaluation)
/sc:evaluate --skip-eval
# Specify judge model
/sc:evaluate --judge-model=gpt-4o
# Dry run to preview plan
/sc:evaluate --dry-run
Behavioral Flow
- Discover - Find evaluation script, gold standards, and prompt files
- Configure - Parse scope (cases, steps, model overrides)
- Execute - Run pipeline on gold standard inputs
- Evaluate - Score outputs against gold standards via LLM-as-judge
- Analyze - Identify weak steps, dimension breakdowns, patterns
- Recommend - Suggest specific prompt improvements for low-scoring steps
- Report - Generate JSON + Markdown evaluation reports
Flags
| Flag | Type | Default | Description |
|---|---|---|---|
--cases |
string | all | Comma-separated test case IDs to evaluate |
--steps |
string | all | Comma-separated step numbers to evaluate |
--model |
string | env default | Override pipeline model |
--judge-model |
string | env default | Override judge/oracle model |
--skip-pipeline |
bool | false | Skip pipeline execution, evaluate existing results |
--skip-eval |
bool | false | Run pipeline only, skip evaluation |
--dry-run |
bool | false | Preview execution plan without API calls |
--output |
string | eval_runs/YYYYMMDD_HHMMSS/ |
Output directory |
--concurrency |
int | 5 | Parallel judge calls |
--threshold |
int | 70 | Score threshold for "needs improvement" |
Phase 1: Discover Project Structure
Locate evaluation components:
| Component | Common Locations | Purpose |
|---|---|---|
| Evaluation script | scripts/run_eval.py, eval/run.py |
Orchestrates pipeline + scoring |
| Gold standards | gold_standards/, test_data/, fixtures/ |
Expected outputs |
| Prompts | prompts/, templates/ |
Pipeline prompt templates |
| Rubrics | eval/rubrics.py, config/rubrics.yaml |
Scoring dimensions and weights |
If no standard structure found, ask the user to specify paths.
Phase 2: Configure Scope
Parse arguments to determine:
- Which test cases to run (default: all discovered)
- Which pipeline steps to evaluate (default: all)
- Model overrides for pipeline and judge
- Output directory (default: timestamped)
Create output directory:
OUTPUT_DIR="${output:-eval_runs/$(date +%Y%m%d_%H%M%S)}"
mkdir -p "$OUTPUT_DIR"
Phase 3: Execute Pipeline
Run the pipeline on gold standard inputs:
python <eval_script> \
--output "$OUTPUT_DIR" \
--verbose \
[--cases CASES] \
[--steps STEPS] \
[--model MODEL] \
[--skip-pipeline] \
[--skip-eval]
API call estimation:
- Pipeline: steps x cases API calls
- Evaluation: scored_dimensions x cases judge calls
For quick validation, suggest running on 1-2 cases with 2-3 steps first.
Phase 4: Evaluate with LLM-as-Judge
For each step output, compare against gold standard using oracle LLM-as-judge:
Evaluation dimensions (customizable per project):
| Dimension | What It Measures |
|---|---|
| Content Agreement | Do outputs cover the same key points? |
| Structure Match | Is the organization/format similar? |
| Detail Accuracy | Are specific claims and data correct? |
| Completeness | Are all expected elements present? |
Each dimension has a weight (0.0-1.0) summing to 1.0 per step.
Phase 5: Analyze Results
Read and analyze evaluation report:
- Overall similarity score across all cases and steps
- Per-step scores — highlight any below threshold (default: 70/100)
- Per-case scores — identify consistently weak test cases
- Dimension breakdowns for weak steps
Score interpretation:
| Score Range | Assessment | Action |
|---|---|---|
| 85-100 | Excellent | No changes needed |
| 70-84 | Good | Minor tuning possible |
| 60-69 | Needs improvement | Prompt revision recommended |
| Below 60 | Poor | Prompt likely needs rewrite |
Phase 6: Recommend Improvements
For each step scoring below threshold:
- Read the current prompt template
- Read the gold standard output (expected)
- Read the pipeline output (actual)
- Compare and identify gaps:
- Missing instructions that gold standard captures
- Overly broad instructions causing divergent output
- Format/structure differences
- Specificity gaps
Present actionable suggestions:
### Step N: <step_name> (Score: XX/100)
**Weakest Dimension**: <dimension> (XX/100)
**Gap Analysis**:
- Gold standard includes <X> but prompt doesn't instruct it
- Output format diverges: gold uses <format>, output uses <other>
**Suggested Prompt Changes**:
1. Add instruction: "<specific instruction>"
2. Clarify format: "<format guidance>"
3. Add example: "<example output snippet>"
Output Structure
eval_runs/YYYYMMDD_HHMMSS/
results/ # Pipeline outputs
case_1/
step_01_<name>.md
step_02_<name>.md
...
case_2/
...
evaluation/ # Judge scores
evaluation_report.json
evaluation_report.md
per_step_scores.csv
per_case_scores.csv
MCP Integration
PAL MCP (Optional)
| Tool | When | Purpose |
|---|---|---|
mcp__pal__thinkdeep |
Low-scoring steps | Deep analysis of why outputs diverge |
mcp__pal__consensus |
Prompt revision | Multi-model validation of proposed changes |
mcp__pal__codereview |
Eval script | Review evaluation pipeline code |
Rube MCP (Optional)
| Tool | When | Purpose |
|---|---|---|
mcp__rube__RUBE_REMOTE_WORKBENCH |
Large eval runs | Process results in Python sandbox |
mcp__rube__RUBE_MULTI_EXECUTE_TOOL |
Notifications | Report results to Slack/email |
Error Handling
| Scenario | Action |
|---|---|
| No eval script found | Ask user for script path |
| No gold standards found | Ask user for gold standard directory |
| API rate limit | Reduce concurrency, add delays |
| Pipeline step fails | Log error, continue with remaining steps |
| Judge returns invalid score | Retry once, then flag for manual review |
| Output directory exists | Append timestamp suffix |
Guardrails
- Always pass
--verbosefor progress visibility - Warn about API call counts before full runs
- Suggest quick validation on subset before full evaluation
- Preserve all intermediate outputs for debugging
- Never modify gold standard files
Tool Coordination
- Bash - Run evaluation scripts
- Read - Inspect prompts, gold standards, outputs, reports
- Write - Generate reports
- Grep - Search for patterns in outputs
- PAL MCP - Deep analysis of score gaps
More from tony363/superclaude
agent-technical-writer
Expert technical writer specializing in clear, accurate documentation and content creation. Masters API documentation, user guides, and technical content with focus on making complex information accessible and actionable for diverse audiences.
32agent-kubernetes-specialist
Expert Kubernetes specialist mastering container orchestration, cluster management, and cloud-native architectures. Specializes in production-grade deployments, security hardening, and performance optimization with focus on scalability and reliability.
30ask
Present a multiple-choice question to the user using AskUserQuestion tool
29sc-explain
Provide clear explanations of code, concepts, and system behavior with educational clarity. Use when understanding code, learning concepts, or knowledge transfer.
28agent-data-engineer
Expert data engineer specializing in building scalable data pipelines, ETL/ELT processes, and data infrastructure. Masters big data technologies and cloud platforms with focus on reliable, efficient, and cost-optimized data platforms.
28agent-ml-engineer
Expert ML engineer specializing in machine learning model lifecycle, production deployment, and ML system optimization. Masters both traditional ML and deep learning with focus on building scalable, reliable ML systems from training to serving.
26