skill-test
Databricks Skills Testing Framework
Offline YAML-first evaluation with human-in-the-loop review and interactive skill improvement.
Quick References
- Scorers - Available scorers and quality gates
- YAML Schemas - Manifest and ground truth formats
- Python API - Programmatic usage examples
- Workflows - Detailed example workflows
- Trace Evaluation - Session trace analysis
/skill-test Command
The /skill-test command provides an interactive CLI for testing Databricks skills with real execution on Databricks.
Basic Usage
/skill-test <skill-name> [subcommand]
Subcommands
| Subcommand | Description |
|---|---|
run |
Run evaluation against ground truth (default) |
regression |
Compare current results against baseline |
init |
Initialize test scaffolding for a new skill |
add |
Interactive: prompt -> invoke skill -> test -> save |
add --trace |
Add test case with trace evaluation |
review |
Review pending candidates interactively |
review --batch |
Batch approve all pending candidates |
baseline |
Save current results as regression baseline |
mlflow |
Run full MLflow evaluation with LLM judges |
trace-eval |
Evaluate traces against skill expectations |
list-traces |
List available traces (MLflow or local) |
scorers |
List configured scorers for a skill |
scorers update |
Add/remove scorers or update default guidelines |
sync |
Sync YAML to Unity Catalog (Phase 2) |
Quick Examples
/skill-test databricks-spark-declarative-pipelines run
/skill-test databricks-spark-declarative-pipelines add --trace
/skill-test databricks-spark-declarative-pipelines review --batch --filter-success
/skill-test my-new-skill init
See Workflows for detailed examples of each subcommand.
Execution Instructions
Environment Setup
uv pip install -e .test/
Environment variables for Databricks MLflow:
DATABRICKS_CONFIG_PROFILE- Databricks CLI profile (default: "DEFAULT")MLFLOW_TRACKING_URI- Set to "databricks" for Databricks MLflowMLFLOW_EXPERIMENT_NAME- Experiment path (e.g., "/Users/{user}/skill-test")
Running Scripts
All subcommands have corresponding scripts in .test/scripts/:
uv run python .test/scripts/{subcommand}.py {skill_name} [options]
| Subcommand | Script |
|---|---|
run |
run_eval.py |
regression |
regression.py |
init |
init_skill.py |
add |
add.py |
review |
review.py |
baseline |
baseline.py |
mlflow |
mlflow_eval.py |
scorers |
scorers.py |
scorers update |
scorers_update.py |
sync |
sync.py |
trace-eval |
trace_eval.py |
list-traces |
list_traces.py |
_routing mlflow |
routing_eval.py |
Use --help on any script for available options.
Command Handler
When /skill-test is invoked, parse arguments and execute the appropriate command.
Argument Parsing
args[0]= skill_name (required)args[1]= subcommand (optional, default: "run")
Subcommand Routing
| Subcommand | Action |
|---|---|
run |
Execute run(skill_name, ctx) and display results |
regression |
Execute regression(skill_name, ctx) and display comparison |
init |
Execute init(skill_name, ctx) to create scaffolding |
add |
Prompt for test input, invoke skill, run interactive() |
review |
Execute review(skill_name, ctx) to review pending candidates |
baseline |
Execute baseline(skill_name, ctx) to save as regression baseline |
mlflow |
Execute mlflow_eval(skill_name, ctx) with MLflow logging |
scorers |
Execute scorers(skill_name, ctx) to list configured scorers |
scorers update |
Execute scorers_update(skill_name, ctx, ...) to modify scorers |
init Behavior
When running /skill-test <skill-name> init:
- Read the skill's SKILL.md to understand its purpose
- Create
manifest.yamlwith appropriate scorers and trace_expectations - Create empty
ground_truth.yamlandcandidates.yamltemplates - Recommend test prompts based on documentation examples
Follow with /skill-test <skill-name> add using recommended prompts.
Context Setup
Create CLIContext with MCP tools before calling any command. See Python API for details.
File Locations
Important: All test files are stored at the repository root level, not relative to this skill's directory.
| File Type | Path |
|---|---|
| Ground truth | {repo_root}/.test/skills/{skill-name}/ground_truth.yaml |
| Candidates | {repo_root}/.test/skills/{skill-name}/candidates.yaml |
| Manifest | {repo_root}/.test/skills/{skill-name}/manifest.yaml |
| Routing tests | {repo_root}/.test/skills/_routing/ground_truth.yaml |
| Baselines | {repo_root}/.test/baselines/{skill-name}/baseline.yaml |
For example, to test databricks-spark-declarative-pipelines in this repository:
/Users/.../ai-dev-kit/.test/skills/databricks-spark-declarative-pipelines/ground_truth.yaml
Not relative to the skill definition:
/Users/.../ai-dev-kit/.claude/skills/skill-test/skills/... # WRONG
Directory Structure
.test/ # At REPOSITORY ROOT (not skill directory)
├── pyproject.toml # Package config (pip install -e ".test/")
├── README.md # Contributor documentation
├── SKILL.md # Source of truth (synced to .claude/skills/)
├── install_skill_test.sh # Sync script
├── scripts/ # Wrapper scripts
│ ├── _common.py # Shared utilities
│ ├── run_eval.py
│ ├── regression.py
│ ├── init_skill.py
│ ├── add.py
│ ├── baseline.py
│ ├── mlflow_eval.py
│ ├── routing_eval.py
│ ├── trace_eval.py # Trace evaluation
│ ├── list_traces.py # List available traces
│ ├── scorers.py
│ ├── scorers_update.py
│ └── sync.py
├── src/
│ └── skill_test/ # Python package
│ ├── cli/ # CLI commands module
│ ├── fixtures/ # Test fixture setup
│ ├── scorers/ # Evaluation scorers
│ ├── grp/ # Generate-Review-Promote pipeline
│ └── runners/ # Evaluation runners
├── skills/ # Per-skill test definitions
│ ├── _routing/ # Routing test cases
│ └── {skill-name}/ # Skill-specific tests
│ ├── ground_truth.yaml
│ ├── candidates.yaml
│ └── manifest.yaml
├── tests/ # Unit tests
├── references/ # Documentation references
└── baselines/ # Regression baselines
References
- Scorers - Available scorers and quality gates
- YAML Schemas - Manifest and ground truth formats
- Python API - Programmatic usage examples
- Workflows - Detailed example workflows
- Trace Evaluation - Session trace analysis
More from databricks-solutions/ai-dev-kit
databricks-python-sdk
Databricks development guidance including Python SDK, Databricks Connect, CLI, and REST API. Use when working with databricks-sdk, databricks-connect, or Databricks APIs.
132python-dev
Python development guidance with code quality standards, error handling, testing practices, and environment management. Use when writing, reviewing, or modifying Python code (.py files) or Jupyter notebooks (.ipynb files).
68databricks-docs
Databricks documentation reference via llms.txt index. Use when other skills do not cover a topic, looking up unfamiliar Databricks features, or needing authoritative docs on APIs, configurations, or platform capabilities.
29databricks-config
Manage Databricks workspace connections: check current workspace, switch profiles, list available workspaces, or authenticate to a new workspace. Use when the user mentions \"switch workspace\", \"which workspace\", \"current profile\", \"databrickscfg\", \"connect to workspace\", or \"databricks auth\".
26databricks-app-python
Builds Python-based Databricks applications using Dash, Streamlit, Gradio, Flask, FastAPI, or Reflex. Handles OAuth authorization (app and user auth), app resources, SQL warehouse and Lakebase connectivity, model serving integration, foundation model APIs, LLM integration, and deployment. Use when building Python web apps, dashboards, ML demos, or REST APIs for Databricks, or when the user mentions Streamlit, Dash, Gradio, Flask, FastAPI, Reflex, or Databricks app.
22databricks-unity-catalog
Unity Catalog system tables and volumes. Use when querying system tables (audit, lineage, billing) or working with volume file operations (upload, download, list files in /Volumes/).
22