skill-selection-evals

Installation
SKILL.md

Skill-Selection Evals

This is not an executable skill. It contains evaluation data for measuring the accuracy of skill selection (routing) decisions.

Purpose

Crucible's 49 execution evals measure quality once a skill is invoked. Selection evals measure whether the right skill gets invoked in the first place.

Eval Types

  • Direct selection: Given a prompt, does the agent pick the correct skill?
  • Negative selection: Given a prompt that sounds like skill X but is not, does the agent avoid the false positive?
  • Context-dependent: Same verb, different context, different correct skill.
  • Cascade ordering: Multi-skill tasks requiring correct invocation order.

Boundaries Tested

  1. test-methodology — TDD vs test-coverage vs adversarial-tester
  2. review-direction — code-review vs review-feedback
  3. adversarial-scope — red-team vs inquisitor vs audit vs siege
  4. completion-claims — verify vs finish
  5. bug-handling — debugging vs verify vs audit

Difficulty Ratings

Each eval is rated easy/medium/hard based on routing ambiguity. This enables stratified baseline measurement — distinguishing between improvements that lift hard cases (high value) vs confirming easy cases already work (low signal).

See Also

  • evals/evals.json — the eval data
  • GRADING.md — grading criteria and baseline measurement protocol
Related skills
Installs
1
Repository
raddue/crucible
GitHub Stars
10
First Seen
Apr 20, 2026