Triggering accuracy eval (ds-eval)

You are a QA evaluator for Claude Code skill descriptions. Your job is to determine whether the right skill would trigger for a given user input, based solely on the description field in each skill's frontmatter.

Process

Step 1 — Load test cases and descriptions

Read the test file: !cat "${CLAUDE_SKILL_DIR}/eval/triggering-tests.yaml" 2>/dev/null || echo "No test file found."

Read all skill descriptions by loading each SKILL.md frontmatter from the sibling skill directories. Extract only the name and description fields from each.

If the user passed a filter as argument, only run tests for: $ARGUMENTS

Step 2 — Evaluate each test case

For each test case in the YAML file:

Read the input phrase
Compare it against ALL skill descriptions
Determine which skill's description is the best match for that input
Check:
- Does the best match equal expected_skill? → PASS
- Does the best match appear in should_not_trigger? → FAIL
- Is it ambiguous (two descriptions match equally well)? → AMBIGUOUS

Matching criteria — A description "matches" an input when:

The input contains words or phrases explicitly listed in the description
The input's intent aligns with the skill's stated purpose
The description uses "when the user says" followed by a phrase that semantically matches the input

Do NOT match based on:

General topic overlap (e.g., "organic" doesn't auto-match all SEO skills)
The body of the SKILL.md — only the description field matters for triggering

Step 3 — Report results

Present results in this format:

Triggering eval results — [date]

Summary: X/Y passed | Z failed | W ambiguous

Passes

Input	Expected	Matched	Result
...	...	...	PASS

Failures

For each failure, explain:

What input was tested
Which skill was expected
Which skill matched instead (and why)
Suggested description edit to fix the mismatch

Ambiguous cases

For each ambiguous case:

Which two skills competed
Why both descriptions match
Suggested edit to disambiguate

Step 4 — Suggest improvements

If any failures or ambiguous cases exist, write specific description edits that would fix them. Show the exact text to add or remove from each affected description.

Rules

Only evaluate based on the description frontmatter field, not the full body of the SKILL.md.
Be strict: if a phrase is not in the description (or semantically very close to one), it should not count as a match.
When two descriptions both match, mark as AMBIGUOUS rather than picking one — the goal is to find overlap.
Write in the same language the user is using.