databricks-mlflow-evaluation
MLflow 3 GenAI Evaluation
Before Writing Any Code
- Read GOTCHAS.md - 15+ common mistakes that cause failures
- Read CRITICAL-interfaces.md - Exact API signatures and data schemas
End-to-End Workflows
Follow these workflows based on your goal. Each step indicates which reference files to read.
Workflow 1: First-Time Evaluation Setup
For users new to MLflow GenAI evaluation or setting up evaluation for a new agent.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Understand what to evaluate | user-journeys.md (Journey 0: Strategy) |
| 2 | Learn API patterns | GOTCHAS.md + CRITICAL-interfaces.md |
| 3 | Build initial dataset | patterns-datasets.md (Patterns 1-4) |
| 4 | Choose/create scorers | patterns-scorers.md + CRITICAL-interfaces.md (built-in list) |
| 5 | Run evaluation | patterns-evaluation.md (Patterns 1-3) |
Workflow 2: Production Trace -> Evaluation Dataset
For building evaluation datasets from production traces.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Search and filter traces | patterns-trace-analysis.md (MCP tools section) |
| 2 | Analyze trace quality | patterns-trace-analysis.md (Patterns 1-7) |
| 3 | Tag traces for inclusion | patterns-datasets.md (Patterns 16-17) |
| 4 | Build dataset from traces | patterns-datasets.md (Patterns 6-7) |
| 5 | Add expectations/ground truth | patterns-datasets.md (Pattern 2) |
Workflow 3: Performance Optimization
For debugging slow or expensive agent execution.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Profile latency by span | patterns-trace-analysis.md (Patterns 4-6) |
| 2 | Analyze token usage | patterns-trace-analysis.md (Pattern 9) |
| 3 | Detect context issues | patterns-context-optimization.md (Section 5) |
| 4 | Apply optimizations | patterns-context-optimization.md (Sections 1-4, 6) |
| 5 | Re-evaluate to measure impact | patterns-evaluation.md (Pattern 6-7) |
Workflow 4: Regression Detection
For comparing agent versions and finding regressions.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Establish baseline | patterns-evaluation.md (Pattern 4: named runs) |
| 2 | Run current version | patterns-evaluation.md (Pattern 1) |
| 3 | Compare metrics | patterns-evaluation.md (Patterns 6-7) |
| 4 | Analyze failing traces | patterns-trace-analysis.md (Pattern 7) |
| 5 | Debug specific failures | patterns-trace-analysis.md (Patterns 8-9) |
Workflow 5: Custom Scorer Development
For creating project-specific evaluation metrics.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Understand scorer interface | CRITICAL-interfaces.md (Scorer section) |
| 2 | Choose scorer pattern | patterns-scorers.md (Patterns 4-11) |
| 3 | For multi-agent scorers | patterns-scorers.md (Patterns 13-16) |
| 4 | Test with evaluation | patterns-evaluation.md (Pattern 1) |
Workflow 6: Unity Catalog Trace Ingestion & Production Monitoring
For storing traces in Unity Catalog, instrumenting applications, and enabling continuous production monitoring.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Link UC schema to experiment | patterns-trace-ingestion.md (Patterns 1-2) |
| 2 | Set trace destination | patterns-trace-ingestion.md (Patterns 3-4) |
| 3 | Instrument your application | patterns-trace-ingestion.md (Patterns 5-8) |
| 4 | Configure trace sources (Apps/Serving/OTEL) | patterns-trace-ingestion.md (Patterns 9-11) |
| 5 | Enable production monitoring | patterns-trace-ingestion.md (Patterns 12-13) |
| 6 | Query and analyze UC traces | patterns-trace-ingestion.md (Pattern 14) |
Workflow 7: Judge Alignment with MemAlign
For aligning an LLM judge to match domain expert preferences. A well-aligned judge improves every downstream use: evaluation accuracy, production monitoring signal, and prompt optimization quality. This workflow is valuable on its own, independent of prompt optimization.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Design base judge with make_judge (any feedback type) |
patterns-judge-alignment.md (Pattern 1) |
| 2 | Run evaluate(), tag successful traces | patterns-judge-alignment.md (Pattern 2) |
| 3 | Build UC dataset + create SME labeling session | patterns-judge-alignment.md (Pattern 3) |
| 4 | Align judge with MemAlign after labeling completes | patterns-judge-alignment.md (Pattern 4) |
| 5 | Register aligned judge to experiment | patterns-judge-alignment.md (Pattern 5) |
| 6 | Re-evaluate with aligned judge (baseline) | patterns-judge-alignment.md (Pattern 6) |
Workflow 8: Automated Prompt Optimization with GEPA
For automatically improving a registered system prompt using optimize_prompts(). Works with any scorer, but paired with an aligned judge (Workflow 7) gives the most domain-accurate signal. For the full end-to-end loop combining alignment and optimization, see user-journeys.md Journey 10.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Build optimization dataset (inputs + expectations) | patterns-prompt-optimization.md (Pattern 1) |
| 2 | Run optimize_prompts() with GEPA + scorer | patterns-prompt-optimization.md (Pattern 2) |
| 3 | Register new version, promote conditionally | patterns-prompt-optimization.md (Pattern 3) |
Reference Files Quick Lookup
| Reference | Purpose | When to Read |
|---|---|---|
GOTCHAS.md |
Common mistakes | Always read first before writing code |
CRITICAL-interfaces.md |
API signatures, schemas | When writing any evaluation code |
patterns-evaluation.md |
Running evals, comparing | When executing evaluations |
patterns-scorers.md |
Custom scorer creation | When built-in scorers aren't enough |
patterns-datasets.md |
Dataset building | When preparing evaluation data |
patterns-trace-analysis.md |
Trace debugging | When analyzing agent behavior |
patterns-context-optimization.md |
Token/latency fixes | When agent is slow or expensive |
patterns-trace-ingestion.md |
UC trace setup, monitoring | When setting up trace storage or production monitoring |
patterns-judge-alignment.md |
MemAlign judge alignment, labeling sessions, SME feedback | When aligning judges to domain expert preferences |
patterns-prompt-optimization.md |
GEPA optimization: build dataset, optimize_prompts(), promote | When running automated prompt improvement |
user-journeys.md |
High-level workflows, full domain-expert optimization loop | When starting a new evaluation project or running the full align + optimize cycle |
Critical API Facts
- Use:
mlflow.genai.evaluate()(NOTmlflow.evaluate()) - Data format:
{"inputs": {"query": "..."}}(nested structure required) - predict_fn: Receives
**unpacked kwargs(not a dict) - MemAlign: Scorer-agnostic (works with any
feedback_value_type-- float, bool, categorical); token-heavy on the embedding model so setembedding_modelexplicitly - Label schema name matching: The label schema
namein the labeling session MUST match the judgenameused inevaluate()foralign()to pair scores - Aligned judge scores: May be lower than unaligned judge scores -- this is expected and means the judge is now more accurate, not that the agent regressed
- GEPA optimization dataset: Must have both
inputsANDexpectationsper record (different from eval dataset) - Episodic memory: Lazily loaded --
get_scorer()results won't show episodic memory on print until the judge is first used - optimize_prompts: Requires MLflow >= 3.5.0
See GOTCHAS.md for complete list.
Related Skills
- databricks-docs - General Databricks documentation reference
- databricks-model-serving - Deploying models and agents to serving endpoints
- databricks-agent-bricks - Building agents that can be evaluated with this skill
- databricks-python-sdk - SDK patterns used alongside MLflow APIs
- databricks-unity-catalog - Unity Catalog tables for managed evaluation datasets