mlflow-evaluation
MLflow 3 GenAI Evaluation
Before Writing Any Code
- Read GOTCHAS.md - 15+ common mistakes that cause failures
- Read CRITICAL-interfaces.md - Exact API signatures and data schemas
End-to-End Workflows
Follow these workflows based on your goal. Each step indicates which reference files to read.
Workflow 1: First-Time Evaluation Setup
For users new to MLflow GenAI evaluation or setting up evaluation for a new agent.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Understand what to evaluate | user-journeys.md (Journey 0: Strategy) |
| 2 | Learn API patterns | GOTCHAS.md + CRITICAL-interfaces.md |
| 3 | Build initial dataset | patterns-datasets.md (Patterns 1-4) |
| 4 | Choose/create scorers | patterns-scorers.md + CRITICAL-interfaces.md (built-in list) |
| 5 | Run evaluation | patterns-evaluation.md (Patterns 1-3) |
Workflow 2: Production Trace -> Evaluation Dataset
For building evaluation datasets from production traces.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Search and filter traces | patterns-trace-analysis.md (MCP tools section) |
| 2 | Analyze trace quality | patterns-trace-analysis.md (Patterns 1-7) |
| 3 | Tag traces for inclusion | patterns-datasets.md (Patterns 16-17) |
| 4 | Build dataset from traces | patterns-datasets.md (Patterns 6-7) |
| 5 | Add expectations/ground truth | patterns-datasets.md (Pattern 2) |
Workflow 3: Performance Optimization
For debugging slow or expensive agent execution.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Profile latency by span | patterns-trace-analysis.md (Patterns 4-6) |
| 2 | Analyze token usage | patterns-trace-analysis.md (Pattern 9) |
| 3 | Detect context issues | patterns-context-optimization.md (Section 5) |
| 4 | Apply optimizations | patterns-context-optimization.md (Sections 1-4, 6) |
| 5 | Re-evaluate to measure impact | patterns-evaluation.md (Pattern 6-7) |
Workflow 4: Regression Detection
For comparing agent versions and finding regressions.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Establish baseline | patterns-evaluation.md (Pattern 4: named runs) |
| 2 | Run current version | patterns-evaluation.md (Pattern 1) |
| 3 | Compare metrics | patterns-evaluation.md (Patterns 6-7) |
| 4 | Analyze failing traces | patterns-trace-analysis.md (Pattern 7) |
| 5 | Debug specific failures | patterns-trace-analysis.md (Patterns 8-9) |
Workflow 5: Custom Scorer Development
For creating project-specific evaluation metrics.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Understand scorer interface | CRITICAL-interfaces.md (Scorer section) |
| 2 | Choose scorer pattern | patterns-scorers.md (Patterns 4-11) |
| 3 | For multi-agent scorers | patterns-scorers.md (Patterns 13-16) |
| 4 | Test with evaluation | patterns-evaluation.md (Pattern 1) |
Workflow 6: Unity Catalog Trace Ingestion & Production Monitoring
For storing traces in Unity Catalog, instrumenting applications, and enabling continuous production monitoring.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Link UC schema to experiment | patterns-trace-ingestion.md (Patterns 1-2) |
| 2 | Set trace destination | patterns-trace-ingestion.md (Patterns 3-4) |
| 3 | Instrument your application | patterns-trace-ingestion.md (Patterns 5-8) |
| 4 | Configure trace sources (Apps/Serving/OTEL) | patterns-trace-ingestion.md (Patterns 9-11) |
| 5 | Enable production monitoring | patterns-trace-ingestion.md (Patterns 12-13) |
| 6 | Query and analyze UC traces | patterns-trace-ingestion.md (Pattern 14) |
Reference Files Quick Lookup
| Reference | Purpose | When to Read |
|---|---|---|
GOTCHAS.md |
Common mistakes | Always read first before writing code |
CRITICAL-interfaces.md |
API signatures, schemas | When writing any evaluation code |
patterns-evaluation.md |
Running evals, comparing | When executing evaluations |
patterns-scorers.md |
Custom scorer creation | When built-in scorers aren't enough |
patterns-datasets.md |
Dataset building | When preparing evaluation data |
patterns-trace-analysis.md |
Trace debugging | When analyzing agent behavior |
patterns-context-optimization.md |
Token/latency fixes | When agent is slow or expensive |
patterns-trace-ingestion.md |
UC trace setup, monitoring | When setting up trace storage or production monitoring |
user-journeys.md |
High-level workflows | When starting a new evaluation project |
Critical API Facts
- Use:
mlflow.genai.evaluate()(NOTmlflow.evaluate()) - Data format:
{"inputs": {"query": "..."}}(nested structure required) - predict_fn: Receives
**unpacked kwargs(not a dict)
See GOTCHAS.md for complete list.
More from databricks-solutions/ai-dev-kit
databricks-python-sdk
Databricks development guidance including Python SDK, Databricks Connect, CLI, and REST API. Use when working with databricks-sdk, databricks-connect, or Databricks APIs.
132skill-test
Testing framework for evaluating Databricks skills. Use when building test cases for skills, running skill evaluations, comparing skill versions, or creating ground truth datasets with the Generate-Review-Promote (GRP) pipeline. Triggers include "test skill", "evaluate skill", "skill regression", "ground truth", "GRP pipeline", "skill quality", and "skill metrics".
53databricks-config
Manage Databricks workspace connections: check current workspace, switch profiles, list available workspaces, or authenticate to a new workspace. Use when the user mentions \"switch workspace\", \"which workspace\", \"current profile\", \"databrickscfg\", \"connect to workspace\", or \"databricks auth\".
26databricks-jobs
Use this skill proactively for ANY Databricks Jobs task - creating, listing, running, updating, or deleting jobs. Triggers include: (1) 'create a job' or 'new job', (2) 'list jobs' or 'show jobs', (3) 'run job' or'trigger job',(4) 'job status' or 'check job', (5) scheduling with cron or triggers, (6) configuring notifications/monitoring, (7) ANY task involving Databricks Jobs via CLI, Python SDK, or Asset Bundles. ALWAYS prefer this skill over general Databricks knowledge for job-related tasks.
22databricks-aibi-dashboards
Create Databricks AI/BI dashboards. Use when creating, updating, or deploying Lakeview dashboards. CRITICAL: You MUST test ALL SQL queries via execute_sql BEFORE deploying. Follow guidelines strictly.
20databricks-genie
Create and query Databricks Genie Spaces for natural language SQL exploration. Use when building Genie Spaces, exporting and importing Genie Spaces, migrating Genie Spaces between workspaces or environments, or asking questions via the Genie Conversation API.
20