skill-test
Skill Test — skills-for-fabric Evaluation Framework
Manage the end-to-end evaluation framework for skills-for-fabric. This skill routes requests to the correct workflow based on user intent — adding tests, listing tests, running tests, viewing results, generating data, or checking coverage.
When to Use
- When a contributor wants to add evaluation test cases for a new or existing skill
- When someone asks to see what tests exist or what results look like
- When a user wants to run the test suite
- When reviewing eval metrics or checking which skills lack test coverage
Intent Routing
Parse the user request and route to the appropriate workflow:
| User Intent | Trigger Phrases | Action |
|---|---|---|
| Add evals | "add tests", "add evals", "add evals for missing skills", "create eval plan" | → Workflow: Add Evals |
| List tests | "list tests", "list evals", "show me the list of tests", "what tests exist", "show eval plans" | → Workflow: List Tests |
| Run tests | "run tests", "run evals", "execute tests", "run the eval suite" | → Workflow: Run Tests |
| View results | "show eval results", "test results", "eval results", "executive summary" | → Workflow: View Results |
| Generate data | "generate eval data", "generate test data", "create eval datasets" | → Workflow: Generate Data |
| View metrics | "eval metrics", "test metrics", "what metrics", "how are tests scored" | → Workflow: View Metrics |
| Check coverage | "test coverage", "which skills have tests", "missing tests", "skills without evals" | → Workflow: Check Coverage |
Workflow: Add Evals
Follow the instructions in tests/full-eval-tests/README.md § "Adding Evals for New Skills".
Automated Path (Recommended)
Give the agent the prompt:
Add evals for the missing skills
The agent will:
- Detect missing skills by comparing installed skills against existing eval plans in
tests/full-eval-tests/plan/03-individual-skills/ - Generate individual eval plans (
plan/03-individual-skills/eval-<skill-name>.md) with 10–12 test cases - Generate combined eval plans (
plan/04-combined-skills/eval-<skill>-authoring-plus-consumption.md) - Create golden data in
tests/full-eval-tests/evalsets/expected-results/ - Update tracking files:
plan/00-overview.md,README.md,plan/04-combined-skills/eval-full-pipeline.md
Manual Path
To add evals for a specific skill <new-skill>:
- Create
tests/full-eval-tests/plan/03-individual-skills/eval-<new-skill>.mdusing the template in the README - Each test case needs: Case ID (unique prefix), Prompt, Expected result, Pass criteria, at least one negative/ambiguous test
- If the skill has an authoring+consumption pair, create
tests/full-eval-tests/plan/04-combined-skills/eval-<new-skill>-authoring-plus-consumption.md - Add golden data to
tests/full-eval-tests/evalsets/expected-results/ - Update
plan/00-overview.md,README.mddirectory tree, andplan/04-combined-skills/eval-full-pipeline.md
Eval Plan Template
Use the template from tests/full-eval-tests/README.md § "Eval Plan Template". Every eval plan must include:
- Skill overview (name, category, R/W, purpose)
- Pre-requisites
- Numbered test cases (XX-01 through XX-10+) with Prompt / Expected / Pass criteria
- At least one negative/ambiguous test case as the last case
- Write Operations table (if the skill writes data)
- Expected Token Range
Workflow: List Tests
Show the user what eval plans and test cases exist.
Individual Skill Evals
List files in tests/full-eval-tests/plan/03-individual-skills/:
ls tests/full-eval-tests/plan/03-individual-skills/
Combined Skill Evals
List files in tests/full-eval-tests/plan/04-combined-skills/:
ls tests/full-eval-tests/plan/04-combined-skills/
Quick Tests (tests.json)
Show the test cases defined in tests/tests.json — these are the prompt-based tests run by the test runner.
Recommended Execution Order
| Order | Eval Plan | Reason |
|---|---|---|
| 1 | eval-check-updates.md | Verify skills are installed |
| 2 | eval-spark-authoring.md | Create lakehouses and load data |
| 3 | eval-sqldw-authoring.md | Create warehouse tables and load data |
| 4 | eval-eventhouse-authoring.md | Create Eventhouse tables and ingest data |
| 5 | eval-spark-consumption.md | Read back lakehouse data |
| 6 | eval-sqldw-consumption.md | Read back warehouse data |
| 7 | eval-eventhouse-consumption.md | Read back Eventhouse data |
| 8 | eval-medallion.md | End-to-end medallion pipeline |
Workflow: Run Tests
⛔ DO NOT execute tests from this skill. The agent must NEVER run
copilot,run-full-tests.ps1, or any eval prompt directly. Instead, tell the user the exact commands to run manually.
When the user asks to run tests, respond only with instructions. Do not execute any commands. Tell the user:
-
Open a terminal and navigate to the
tests/directory at the repository root:cd tests -
Run the full test suite:
.\run-full-tests.ps1 -
To specify an output directory:
.\run-full-tests.ps1 -TestFolder C:\temp\eval-run-01
Important
- The agent must NEVER run tests itself — only provide the user with instructions
- Tests must be run by the user from inside the
tests/folder - The script copies the eval framework to a working folder and launches copilot there
Workflow: View Results
Show the user existing evaluation results.
Detailed Results
Read tests/full-eval-tests/eval-results.md — contains per-skill, per-test-case pass/fail with notes, consistency test results, failure analysis, and skip reasons.
Executive Summary
Read tests/full-eval-tests/executive-summary.md — contains the high-level summary: overall pass rate, results by skill, data consistency scores, failure analysis, and recommendations.
Key Metrics from Latest Run
| Metric | Value |
|---|---|
| Overall pass rate | 94.7% (54/57 executed) |
| Write/Read consistency | 100% (5/5 exact matches) |
| Total test cases | 74 |
| Skipped | 17 |
Workflow: Generate Data
Generate synthetic evaluation datasets using the specifications in tests/full-eval-tests/plan/01-data-generation.md.
Using the Generation Script
python tests/full-eval-tests/evalsets/data-generation/generate.py
Datasets
| Dataset | Rows | Format | Used By |
|---|---|---|---|
| sales_transactions | 100 / 1K / 10K | CSV | SQL DW, Spark |
| customers | 100 | CSV | Join testing |
| products | 50 | CSV | Join testing |
| sensor_readings | 500 | JSON | Spark semi-structured |
Golden Results
Pre-computed expected results are in tests/full-eval-tests/evalsets/expected-results/ and are used to verify consistency.
Workflow: View Metrics
Explain the evaluation metrics defined in tests/full-eval-tests/plan/02-metrics.md.
| Metric | Definition |
|---|---|
| Success Rate | passed / total × 100 — whether the skill executed correctly |
| Token Usage | Input + output tokens consumed per eval prompt |
| Read/Write Consistency | Data written by authoring skill must be exactly retrievable by consumption skill |
Grading
| Grade | Criteria |
|---|---|
| PASS | Skill invoked correctly, output matches expected |
| FAIL_INVOCATION | Wrong skill invoked or not invoked |
| FAIL_EXECUTION | Skill invoked but errored |
| FAIL_RESULT | Skill completed but output mismatches |
Pass Thresholds
| Metric | Threshold |
|---|---|
| Success Rate | ≥ 90% per skill |
| Token Usage | Within 2× of baseline |
| Read/Write Consistency | 100% exact match |
Workflow: Check Coverage
Compare installed skills against existing eval plans to identify gaps.
Steps
-
List all skills from the marketplace/plugin:
check-updates, spark-authoring-cli, spark-consumption-cli, sqldw-authoring-cli, sqldw-consumption-cli, eventhouse-authoring-cli, eventhouse-consumption-cli, e2e-medallion-architecture -
List existing individual eval plans:
ls tests/full-eval-tests/plan/03-individual-skills/ -
Compare and report which skills have eval coverage and which are missing.
-
For missing skills, suggest running the Add Evals workflow.
Must
- NEVER execute tests, eval prompts, or the test runner script — only provide instructions for the user to run manually
- Always route "run tests" to the
tests/folder — tell the user to navigate there and runrun-full-tests.ps1 - Follow the eval plan template when creating new eval plans — every test case needs Case ID, Prompt, Expected, Pass criteria
- Include at least one negative/ambiguous test in every new eval plan
- Update tracking files when adding evals —
plan/00-overview.md,README.md,eval-full-pipeline.md - Reference golden data in
evalsets/expected-results/for consistency tests
Prefer
- Automated eval generation ("add evals for missing skills") over manual creation
- Reading existing eval plans as templates before creating new ones
- Running the full suite rather than individual tests for comprehensive coverage
- Checking the executive summary before diving into detailed results
Avoid
- Running eval prompts outside the
tests/directory - Executing tests, copilot commands, or the run-full-tests.ps1 script — the agent must only tell the user how to run them
- Creating eval plans without a negative/ambiguous test case
- Duplicating content already in the README or plan documents — reference them instead
- Modifying golden result files unless the underlying data generation rules change
- Skipping the tracking file updates when adding new evals
Examples
Adding Tests
User: "Add tests for the powerbi-consumption-cli skill"
Agent: Creates tests/full-eval-tests/plan/03-individual-skills/eval-powerbi-consumption.md with 10–12 test cases following the template, adds golden data to evalsets/expected-results/, and updates tracking files.
Listing Tests
User: "What tests exist?"
Agent: Lists all eval plans in plan/03-individual-skills/ and plan/04-combined-skills/, plus the quick tests in tests/tests.json, with the recommended execution order.
Running Tests
User: "Run the tests"
Agent:
Tests must be run from the
tests/folder. Navigate there and execute:cd tests .\run-full-tests.ps1
Viewing Results
User: "Show me the eval results"
Agent: Reads tests/full-eval-tests/eval-results.md and presents the summary table, highlighting pass rates and any failures.