Evals
Customization
Before executing, check for user customizations at:
~/.claude/PAI/USER/SKILLCUSTOMIZATIONS/Evals/
If this directory exists, load and apply any PREFERENCES.md, configurations, or resources found there. These override default behavior. If the directory does not exist, proceed with skill defaults.
🚨 MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)
You MUST send this notification BEFORE doing anything else when this skill is invoked.
-
Send voice notification:
curl -s -X POST http://localhost:31337/notify \ -H "Content-Type: application/json" \ -d '{"message": "Running the WORKFLOWNAME workflow in the Evals skill to ACTION"}' \ > /dev/null 2>&1 & -
Output text notification:
Running the **WorkflowName** workflow in the **Evals** skill to ACTION...
This is not optional. Execute this curl command immediately upon skill invocation.
Evals - AI Agent Evaluation Framework
Comprehensive agent evaluation system based on Anthropic's "Demystifying Evals for AI Agents" (Jan 2026).
Key differentiator: Evaluates agent workflows (transcripts, tool calls, multi-turn conversations), not just single outputs.
When to Activate
- "run evals", "test this agent", "evaluate", "check quality", "benchmark"
- "regression test", "capability test"
- "run scenario", "multi-turn eval", "simulated user test"
- "create scenario", "simulate conversation"
- Compare agent behaviors across changes
- Validate agent workflows before deployment
- Verify ALGORITHM ISC rows
- Create new evaluation tasks from failures
Core Concepts
Three Grader Types
| Type | Strengths | Weaknesses | Use For |
|---|---|---|---|
| Code-based | Fast, cheap, deterministic, reproducible | Brittle, lacks nuance | Tests, state checks, tool verification |
| Model-based | Flexible, captures nuance, scalable | Non-deterministic, expensive | Quality rubrics, assertions, comparisons |
| Human | Gold standard, handles subjectivity | Expensive, slow | Calibration, spot checks, A/B testing |
Evaluation Types
| Type | Pass Target | Purpose |
|---|---|---|
| Capability | ~70% | Stretch goals, measuring improvement potential |
| Regression | ~99% | Quality gates, detecting backsliding |
Key Metrics
- pass@k: Probability of at least 1 success in k trials (measures capability)
- pass^k: Probability all k trials succeed (measures consistency/reliability)
Workflow Routing
| Request Pattern | Route To |
|---|---|
| Run eval, evaluate suite, run tests, benchmark | Workflows/RunEval.md |
| Compare models, model comparison, A/B test models | Workflows/CompareModels.md |
| Compare prompts, prompt comparison, test prompts | Workflows/ComparePrompts.md |
| Create judge, model grader, evaluation judge | Workflows/CreateJudge.md |
| Create use case, new eval, test case, create suite | Workflows/CreateUseCase.md |
| Run scenario, multi-turn eval, simulated user test | Workflows/RunScenario.md |
| Create scenario, new multi-turn eval, simulate conversation | Workflows/CreateScenario.md |
| View results, eval results, scores, pass rate | Workflows/ViewResults.md |
CLI Quick Reference
| Trigger | Tool |
|---|---|
| Run suite | Tools/AlgorithmBridge.ts |
| Log failure | Tools/FailureToTask.ts log |
| Convert failures | Tools/FailureToTask.ts convert-all |
| Create suite | Tools/SuiteManager.ts create |
| Check saturation | Tools/SuiteManager.ts check-saturation |
| Run scenario | Tools/ScenarioRunner.ts --scenario <path> |
Quick Reference
CLI Commands
# Run an eval suite
bun run ${CLAUDE_SKILL_DIR}/Tools/AlgorithmBridge.ts -s <suite>
# Log a failure for later conversion
bun run ${CLAUDE_SKILL_DIR}/Tools/FailureToTask.ts log "description" -c category -s severity
# Convert failures to test tasks
bun run ${CLAUDE_SKILL_DIR}/Tools/FailureToTask.ts convert-all
# Manage suites
bun run ${CLAUDE_SKILL_DIR}/Tools/SuiteManager.ts create <name> -t capability -d "description"
bun run ${CLAUDE_SKILL_DIR}/Tools/SuiteManager.ts list
bun run ${CLAUDE_SKILL_DIR}/Tools/SuiteManager.ts check-saturation <name>
bun run ${CLAUDE_SKILL_DIR}/Tools/SuiteManager.ts graduate <name>
ALGORITHM Integration
Evals is a verification method for THE ALGORITHM ISC rows:
# Run eval and update ISC row
bun run ${CLAUDE_SKILL_DIR}/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u
ISC rows can specify eval verification:
| # | What Ideal Looks Like | Verify |
|---|----------------------|--------|
| 1 | Auth bypass fixed | eval:auth-security |
| 2 | Tests all pass | eval:regression |
Available Graders
Code-Based (Fast, Deterministic)
| Grader | Use Case |
|---|---|
string_match |
Exact substring matching |
regex_match |
Pattern matching |
binary_tests |
Run test files |
static_analysis |
Lint, type-check, security scan |
state_check |
Verify system state after execution |
tool_calls |
Verify specific tools were called |
Model-Based (Nuanced)
| Grader | Use Case |
|---|---|
llm_rubric |
Score against detailed rubric |
natural_language_assert |
Check assertions are true |
pairwise_comparison |
Compare to reference with position swap |
Domain Patterns
Pre-configured grader stacks for common agent types:
| Domain | Primary Graders |
|---|---|
coding |
binary_tests + static_analysis + tool_calls + llm_rubric |
conversational |
llm_rubric + natural_language_assert + state_check |
research |
llm_rubric + natural_language_assert + tool_calls |
computer_use |
state_check + tool_calls + llm_rubric |
See Data/DomainPatterns.yaml for full configurations.
Task Schema (YAML)
task:
id: "fix-auth-bypass_1"
description: "Fix authentication bypass when password is empty"
type: regression # or capability
domain: coding
graders:
- type: binary_tests
required: [test_empty_pw.py]
weight: 0.30
- type: tool_calls
weight: 0.20
params:
sequence: [read_file, edit_file, run_tests]
- type: llm_rubric
weight: 0.50
params:
rubric: prompts/security_review.md
trials: 3
pass_threshold: 0.75
Resource Index
| Resource | Purpose |
|---|---|
Types/index.ts |
Core type definitions |
Graders/CodeBased/ |
Deterministic graders |
Graders/ModelBased/ |
LLM-powered graders |
Tools/TranscriptCapture.ts |
Capture agent trajectories |
Tools/TrialRunner.ts |
Multi-trial execution with pass@k |
Tools/SuiteManager.ts |
Suite management and saturation |
Tools/FailureToTask.ts |
Convert failures to test tasks |
Tools/AlgorithmBridge.ts |
ALGORITHM integration |
Tools/ScenarioRunner.ts |
Multi-turn scenario runner (langwatch/scenario) |
Tools/PAIAgentAdapter.ts |
Wraps PAI Inference.ts as scenario AgentAdapter |
Tools/ScenarioToTranscript.ts |
Scenario result → Evals Transcript/Trial/GraderResult |
Scenarios/ |
Authored multi-turn scenarios (.scenario.ts) |
Data/DomainPatterns.yaml |
Domain-specific grader configs |
Key Principles (from Anthropic)
- Start with 20-50 real failures - Don't overthink, capture what actually broke
- Unambiguous tasks - Two experts should reach identical verdicts
- Balanced problem sets - Test both "should do" AND "should NOT do"
- Grade outputs, not paths - Don't penalize valid creative solutions
- Calibrate LLM judges - Against human expert judgment
- Check transcripts regularly - Verify graders work correctly
- Monitor saturation - Graduate to regression when hitting 95%+
- Build infrastructure early - Evals shape how quickly you can adopt new models
Related
- ALGORITHM: Evals is a verification method
- Science: Evals implements scientific method
- Browser: For visual verification graders
Gotchas
- Choose the right grader type: Code-based for deterministic checks (fast, cheap). Model-based for nuanced quality (flexible, expensive). Human for calibration (gold standard, slow).
- pass@k scoring requires multiple runs. A single run doesn't give statistical significance. Default to pass@3 minimum.
- Transcript capture must be enabled BEFORE the test run. Can't retroactively capture transcripts.
- Eval results go to the current work directory — not a global location. Tie evals to the work item.
- Don't evaluate skills with trivial prompts. Simple one-liners may not trigger skill usage. Test prompts must be substantive.
Examples
Example 1: Compare two prompts
User: "evaluate which prompt produces better summaries"
→ Creates eval suite with 3+ test cases
→ Runs both prompts against test cases
→ Model-based grader scores quality
→ Reports pass@k and comparative analysis
Example 2: Regression test a skill change
User: "run evals on the Research skill after the update"
→ Uses existing test fixtures for Research
→ Before/after comparison
→ Reports any quality regressions
Execution Log
After completing any workflow, append a single JSONL entry:
echo '{"ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","skill":"Evals","workflow":"WORKFLOW_USED","input":"8_WORD_SUMMARY","status":"ok|error","duration_s":SECONDS}' >> ~/.claude/PAI/MEMORY/SKILLS/execution.jsonl
Replace WORKFLOW_USED with the workflow executed, 8_WORD_SUMMARY with a brief input description, and SECONDS with approximate wall-clock time. Log status: "error" if the workflow failed.
More from danielmiessler/personal_ai_infrastructure
osint
Structured OSINT investigations — people lookup, company intel, investment due diligence, entity/threat intel, domain recon, organization research using public sources with ethical authorization framework. USE WHEN OSINT, due diligence, background check, research person, company intel, investigate, company lookup, domain lookup, entity lookup, organization lookup, threat intel, discover OSINT sources.
259firstprinciples
Physics-based reasoning framework (Musk/Elon methodology) that deconstructs problems to irreducible fundamental truths rather than reasoning by analogy. Three-step structure: DECONSTRUCT (break to constituent parts and actual values), CHALLENGE (classify every element as hard constraint / soft constraint / unvalidated assumption — only physics is truly immutable), RECONSTRUCT (build optimal solution from fundamentals alone, ignoring inherited form). Outputs: constituent-parts breakdown, constraint classification table, and reconstructed solution with key insight. Three workflows: Deconstruct.md, Challenge.md, Reconstruct.md. Integrates with RedTeam (attack assumptions before deploying adversarial agents), Security (decompose threat model), Architecture (challenge design constraints), and Pentesters (decompose assumed security boundaries). Other skills invoke via: Challenge on all stated constraints → classify as hard/soft/assumption. Cross-domain synthesis: solutions from unrelated fields often apply once the fundamental truths are exposed. NOT FOR incident investigation and causal chains (use RootCauseAnalysis). NOT FOR structural feedback loops (use SystemsThinking). USE WHEN first principles, fundamental truths, challenge assumptions, is this a real constraint, rebuild from scratch, what are we actually paying for, what is this really made of, start over, physics first, question everything, reasoning by analogy, is this really necessary.
160documents
Read, write, convert, and analyze documents — routes to PDF, DOCX, XLSX, PPTX sub-skills for creation, editing, extraction, and format conversion. USE WHEN document, process file, create document, convert format, extract text, PDF, DOCX, XLSX, PPTX, Word, Excel, spreadsheet, PowerPoint, presentation, slides, consulting report, large PDF, merge PDF, fill form, tracked changes, redlining.
114council
Multi-agent collaborative debate that produces visible round-by-round transcripts with genuine intellectual friction. All council members are custom-composed via ComposeAgent (Agents skill) with domain expertise, unique voice, and personality tailored to the specific topic — never built-in generic types. ComposeAgent invoked as: bun run ~/.claude/skills/Agents/Tools/ComposeAgent.ts. Two workflows: DEBATE (3 rounds, full transcript + synthesis, parallel execution within rounds, 40-90 seconds total) and QUICK (1 round, fast perspective check). Context files: CouncilMembers.md (agent composition instructions), RoundStructure.md (three-round structure and timing), OutputFormat.md (transcript format templates). Agents are designed per debate topic to create real disagreement; 4-6 well-composed agents outperform 12 generic ones. Council is collaborative-adversarial (debate to find best path); for pure adversarial attack on an idea, use RedTeam instead. NOT FOR parallel task execution across agents (use Delegation skill). USE WHEN council, debate, multiple perspectives, weigh options, deliberate, get different views, multi-agent discussion, what would experts say, is there consensus, pros and cons from multiple angles.
112privateinvestigator
Ethical people-finding using 15 parallel research agents (45 search threads) across public records, social media, reverse lookups. Public data only, no pretexting. USE WHEN find person, locate, reconnect, people search, skip trace, reverse lookup, social media search, public records search, verify identity.
112redteam
Military-grade adversarial analysis that deploys 32 parallel expert agents (engineers, architects, pentesters, interns) to stress-test ideas, strategies, and plans — not systems or infrastructure. Two workflows: ParallelAnalysis (5-phase: decompose into 24 atomic claims → 32-agent parallel attack → synthesis → steelman → counter-argument, each 8 points) and AdversarialValidation (competing proposals synthesized into best solution). Context files: Philosophy.md (core principles, success criteria, agent types), Integration.md (how to combine with FirstPrinciples, Council, and other skills; output format). Targets arguments, not network vulnerabilities. Findings ranked by severity; goal is to strengthen, not destroy — weaknesses delivered with remediation paths. Collaborates with FirstPrinciples (decompose assumptions before attacking) and Council (Council debates to find paths; RedTeam attacks whatever survives). Also invoked internally by Ideate (TEST phase) and WorldThreatModel (horizon stress-testing). NOT FOR AI instruction set auditing (use BitterPillEngineering). NOT FOR network/system vulnerability testing (use a security assessment skill). USE WHEN red team, attack idea, counterarguments, critique, stress test, devil's advocate, find weaknesses, break this, poke holes, what could go wrong, strongest objection, adversarial validation, battle of bots.
112