skill-selection-evals
Skill-Selection Evals
This is not an executable skill. It contains evaluation data for measuring the accuracy of skill selection (routing) decisions.
Purpose
Crucible's 49 execution evals measure quality once a skill is invoked. Selection evals measure whether the right skill gets invoked in the first place.
Eval Types
- Direct selection: Given a prompt, does the agent pick the correct skill?
- Negative selection: Given a prompt that sounds like skill X but is not, does the agent avoid the false positive?
- Context-dependent: Same verb, different context, different correct skill.
- Cascade ordering: Multi-skill tasks requiring correct invocation order.
Boundaries Tested
- test-methodology — TDD vs test-coverage vs adversarial-tester
- review-direction — code-review vs review-feedback
- adversarial-scope — red-team vs inquisitor vs audit vs siege
- completion-claims — verify vs finish
- bug-handling — debugging vs verify vs audit
Difficulty Ratings
Each eval is rated easy/medium/hard based on routing ambiguity. This enables stratified baseline measurement — distinguishing between improvements that lift hard cases (high value) vs confirming easy cases already work (low signal).
See Also
evals/evals.json— the eval dataGRADING.md— grading criteria and baseline measurement protocol
More from raddue/crucible
test-driven-development
Use when implementing any feature or bugfix, before writing implementation code
8adversarial-tester
Use after completing implementation to find unknown failure modes. Reads implementation diff and writes up to 5 tests designed to make it break. Triggers on 'break it', 'adversarial test', 'stress test implementation', 'find weaknesses', or any task seeking to expose unknown failure modes.
5quality-gate
Iterative red-teaming of any artifact (design docs, plans, code, hypotheses, mockups). Loops until clean or stagnation. Invoked by artifact-producing skills or their parent orchestrator.
5code-review
Use when completing tasks, implementing major features, or before merging to verify work meets requirements
5finish
Use when implementation is complete, all tests pass, and you need to decide how to integrate the work - guides completion of development work by presenting structured options for merge, PR, or cleanup
4verify
Use when about to claim work is complete, fixed, or passing, before committing or creating PRs - requires running verification commands and confirming output before making any success claims; evidence before assertions always
4