mlflow-genai-evaluation
MLflow GenAI Evaluation Patterns
Production-grade patterns for evaluating Databricks GenAI agents using MLflow 3.0+ mlflow.genai.evaluate() with LLM-as-judge scorers and custom evaluation metrics.
When to Use
- Implementing agent evaluation pipelines with LLM judges
- Creating custom domain-specific evaluation scorers
- Setting up evaluation datasets for agent testing
- Checking deployment thresholds before production deployment
- Troubleshooting evaluation errors (0.0 scores, metric name mismatches)
- Optimizing guidelines for better evaluation scores
- Querying evaluation results programmatically
- Aligning LLM judges with domain expert feedback via MemAlign
- Automated prompt optimization with GEPA (
optimize_prompts()) - Setting up Unity Catalog trace ingestion for production monitoring
Upstream API Note
The upstream databricks-mlflow-evaluation skill in AI-Dev-Kit covers 8 end-to-end workflows: first-time evaluation setup, production trace-to-dataset, performance optimization, regression detection, custom scorer development, UC trace ingestion and production monitoring, judge alignment with MemAlign, and automated prompt optimization with GEPA.
Critical API facts:
- Use
mlflow.genai.evaluate()(NOTmlflow.evaluate()) - Data format:
{"inputs": {"query": "..."}}(nested structure required) - MemAlign is scorer-agnostic (works with any
feedback_value_type) - GEPA optimization dataset must have both
inputsANDexpectationsper record - Requires MLflow >= 3.5.0 for
optimize_prompts()
⚠️ CRITICAL: Response Extraction Helper
MANDATORY: _extract_response_text() must be included in ALL custom scorers.
mlflow.genai.evaluate() serializes ResponsesAgentResponse to a dict before passing to scorers. Without proper extraction, scorers receive serialized dicts and return 0.0 scores (silent failure).
Full implementation: See scripts/evaluation_helpers.py for complete function code.
Why this matters:
- ❌ Without helper: 9+ custom scorers return 0.0 for ALL responses
- ❌ Silent failure - no error messages, just 0.0 scores
- ❌ Took 5+ deployment iterations to discover root cause
- ✅ With helper: Scorers work correctly first time
⚠️ CRITICAL: Databricks SDK for LLM Calls
ALWAYS use Databricks SDK (NOT langchain_databricks) for LLM calls in custom scorers.
| Issue | langchain_databricks | Databricks SDK |
|---|---|---|
| Serverless Compute | ❌ Package install failures | ✅ No install needed |
| Authentication | ❌ Varies by environment | ✅ Automatic in notebooks |
| Deployment Jobs | ❌ Unreliable auth | ✅ Reliable auth |
| Support | ⚠️ Community package | ✅ Official Databricks SDK |
Full implementation: See scripts/evaluation_helpers.py for _call_llm_for_scoring() helper.
Guidelines Best Practice: 4-6 Sections
CRITICAL: Keep guidelines to 4-6 essential sections (NOT 8+).
❌ DON'T: Too Many Guidelines Sections
# BAD: 8 comprehensive guidelines = low scores
guidelines = [
"Section 1: Response Structure (200 words)",
"Section 2: Data Accuracy (150 words)",
"Section 3: No Fabrication (180 words)",
"Section 4: Actionability (160 words)",
"Section 5: Domain Expertise (200 words)",
"Section 6: Cross-Domain Intelligence (150 words)",
"Section 7: Professional Tone (120 words)",
"Section 8: Completeness (170 words)",
]
# Result: guidelines/mean = 0.20 (too strict!)
✅ DO: 4-6 Essential Guidelines
# GOOD: 4 focused, critical guidelines = higher, more meaningful scores
guidelines = [
"""Data Accuracy and Specificity:
- MUST include specific numbers (costs, DBUs, percentages)
- MUST include time context (when data is from)
- MUST include trend direction (increased/decreased)""",
"""No Data Fabrication (CRITICAL):
- MUST NEVER fabricate numbers
- If Genie errors, MUST state explicitly""",
"""Actionability and Recommendations:
- MUST provide specific, actionable next steps
- MUST include concrete implementation details""",
"""Professional Enterprise Tone:
- MUST maintain professional tone
- MUST use proper formatting (markdown, tables)""",
]
# Result: guidelines/mean = 0.5+ (achievable, meaningful)
Why this matters:
- 8+ sections = overly strict scoring (0.20 average)
- 4-6 sections = achievable, meaningful scores (0.50+ average)
- Focus on critical quality dimensions only
Run Naming Convention
ALWAYS use consistent run naming for querying latest evaluation results.
# ✅ CORRECT: Consistent prefix + timestamp
from datetime import datetime
run_name = f"eval_pre_deploy_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
mlflow.genai.evaluate(
model=model_uri,
data=eval_dataset,
model_type="databricks-agent",
evaluators=evaluators,
evaluator_config=evaluator_config,
experiment_name="/Shared/health_monitor_agent_evaluation",
run_name=run_name, # ✅ Consistent naming
)
# Query latest evaluation
runs = mlflow.search_runs(
filter_string="tags.mlflow.runName LIKE 'eval_pre_deploy_%'", # ✅ Predictable
order_by=["start_time DESC"],
max_results=1
)
Why this matters:
- Automated checks can find latest evaluation results
- Consistent naming enables programmatic threshold validation
- CI/CD pipelines can query recent evaluation metrics
Quick Example: @scorer Decorator Pattern
import mlflow
from mlflow.models import Score
from mlflow.metrics import scorer
from typing import Dict, Optional
# Import helpers (from scripts/evaluation_helpers.py)
from evaluation_helpers import _extract_response_text, _call_llm_for_scoring
@mlflow.trace(name="cost_accuracy_judge", span_type="JUDGE")
@scorer
def cost_accuracy_judge(
inputs: Dict,
outputs: Dict,
expectations: Optional[Dict] = None
) -> Score:
"""
Custom judge evaluating cost accuracy in agent responses.
CRITICAL: Must use _extract_response_text() helper.
"""
# ✅ STEP 1: Extract response text (MANDATORY)
response_text = _extract_response_text(outputs)
# ✅ STEP 2: Get query
query = inputs.get("request", "")
# ✅ STEP 3: Build evaluation prompt
judge_prompt = f"""Evaluate cost accuracy...
Query: {query}
Response: {response_text}
Return JSON: {{"score": 0.0-1.0, "rationale": "explanation"}}
"""
# ✅ STEP 4: Call LLM via Databricks SDK (NOT langchain_databricks)
result = _call_llm_for_scoring(judge_prompt, endpoint="databricks-claude-3-7-sonnet")
# ✅ STEP 5: Return Score object
return Score(
value=result["score"],
rationale=result["rationale"]
)
See references/custom-scorer-patterns.md for complete examples.
CRITICAL: make_judge() Template Variable Constraints
make_judge() and the MLflow Prompt Registry use the same {{ }} template syntax but have different validation rules:
| System | Allowed Variable Names | Validation |
|---|---|---|
Prompt Registry (register_prompt()) |
Any {{ variable }} |
No validation — any name accepted |
make_judge(instructions=...) |
Only 5 allowed (see below) | Strict — raises MlflowException on unknown variables |
make_judge() only permits these 5 template variables:
{{ inputs }}— the eval record'sinputsdict{{ outputs }}— thepredict_fnreturn value{{ trace }}— the MLflow trace frompredict_fn{{ expectations }}— the eval record'sexpectationsdict{{ conversation }}— conversation data (for chat models)
Bidirectional constraint:
- MUST contain at least one of the 5 allowed variables (plain text is rejected)
- MUST NOT contain any other
{{ variable }}names (custom variables are rejected)
# WRONG — custom variables crash make_judge()
"Question: {{question}}\nExpected SQL: {{expected_sql}}\nGenerated SQL: {{genie_sql}}"
# WRONG — no variables at all, also crashes make_judge()
"Evaluate the SQL quality and respond with yes or no."
# CORRECT — uses only allowed variables
"User question: {{ inputs }}\nGenerated SQL: {{ outputs }}\nExpected SQL: {{ expectations }}"
CRITICAL: predict_fn Keyword Argument Contract
mlflow.genai.evaluate() unpacks the inputs dict as keyword arguments when calling predict_fn. The function signature must match the keys in eval_records["inputs"].
# Given eval records with:
eval_records = [{"inputs": {"question": "...", "space_id": "...", "expected_sql": "..."}, ...}]
# WRONG — receives keyword args, not a dict
def predict_fn(inputs: dict) -> dict:
question = inputs["question"] # TypeError or MlflowException
# CORRECT — signature matches input keys
def predict_fn(question: str, expected_sql: str = "", **kwargs) -> dict:
# question and expected_sql are unpacked directly
# space_id, catalog, etc. land in **kwargs (use closure for these)
...
Metric Aliases Quick Reference
CRITICAL: Handle metric name variations across MLflow versions.
# Built-in scorers use different metric names across MLflow versions
METRIC_ALIASES = {
"relevance/mean": ["relevance_to_query/mean"], # MLflow 3.0 vs 3.1
"safety/mean": ["safety/mean"], # No alias needed
"guidelines/mean": ["guidelines/mean"], # No alias needed
}
Why aliases matter:
- MLflow 3.0 uses
"relevance/mean" - MLflow 3.1 uses
"relevance_to_query/mean" - Without aliases, threshold checks fail silently
- Deployment succeeds with failing scores (BAD!)
See references/threshold-checking.md for complete check_thresholds() function.
Foundation Model Endpoints Recommendation
ALWAYS use foundation model endpoints (NOT pay-per-token) for judges.
✅ Recommended Endpoints
endpoints:/databricks-claude-sonnet-4-5(recommended)endpoints:/databricks-meta-llama-3-1-405b-instructendpoints:/databricks-claude-3-7-sonnet
❌ Avoid Pay-Per-Token Endpoints
- Evaluation is high-volume
- Pay-per-token gets expensive fast
- Foundation models included in workspace DBU consumption
See references/custom-scorer-patterns.md for complete endpoint list.
Validation Checklist
Before running agent evaluation:
Dataset & Configuration
- Evaluation dataset loaded with correct schema (request, response columns)
- 4-6 essential guidelines defined (not 8+)
- Run name follows convention:
eval_pre_deploy_YYYYMMDD_HHMMSS - Evaluation experiment set correctly
Custom Scorers (CRITICAL)
- ✅
_extract_response_text()helper included in ALL custom scorers - ✅ Databricks SDK used for LLM calls (NOT
langchain_databricks) - ✅
_call_llm_for_scoring()helper defined and used - Custom judges use
@mlflow.traceand@scorerdecorators - Custom judges return
Scoreobject withvalueandrationale - Foundation model endpoints used (not pay-per-token)
- Temperature = 0.0 for judge consistency
Threshold Checking
- ✅
METRIC_ALIASESdefined for backward compatibility - ✅
check_thresholds()function used (handles aliases) - Thresholds defined for all judges
- Results checked against thresholds before deployment
Reference Files
references/custom-scorer-patterns.md- Complete custom scorer patterns with_call_llm_for_scoring()helperreferences/built-in-judges.md- Built-in judges (relevance, safety, guidelines) with thresholdsreferences/threshold-checking.md-check_thresholds()function with metric aliases supportreferences/evaluation-dataset-patterns.md- Evaluation dataset schema and loading patternsscripts/evaluation_helpers.py- Complete helper functions (_extract_response_text(),_call_llm_for_scoring(),check_thresholds())
References
Official Documentation
Related Skills
ml-pipeline-setup- MLflow model patternsresponses-agent-patterns- ResponsesAgent implementation patterns
Scorer vs Evaluator Semantics
make_judge() returns an InstructionsJudge scorer callable — an object intended for use inside mlflow.genai.evaluate(scorers=[...]). Scorers are not standalone evaluators:
- Scorers are callables that receive structured inputs from
mlflow.genai.evaluate()and returnFeedbackobjects. They have no.evaluate()method. mlflow.genai.evaluate()is the evaluator — it orchestrates the predict function, passes data through scorers, and logs results to MLflow.
For inline/conditional LLM calls outside the mlflow.genai.evaluate() harness (e.g., arbiter conditional scoring, ad-hoc quality checks), use direct LLM calls:
_call_llm_for_scoring()viaw.serving_endpoints.query()(Databricks SDK)- Parse JSON verdicts from the LLM response manually
| Use Case | Correct Approach | Wrong Approach |
|---|---|---|
| Running all judges on benchmark suite | mlflow.genai.evaluate(scorers=[judge1, judge2]) |
Calling each judge manually in a loop |
| Conditional scoring (arbiter fires only on disagreement) | _call_llm_for_scoring(prompt) inside a @scorer |
make_judge().evaluate(data) — no .evaluate() method |
| Quick ad-hoc LLM quality check | w.serving_endpoints.query(...) |
make_judge()(inputs) — wrong call signature |
Common Mistakes
| Mistake | Consequence | Fix |
|---|---|---|
Calling make_judge().evaluate() for standalone scoring |
AttributeError: 'InstructionsJudge' object has no attribute 'evaluate' |
Use _call_llm_for_scoring() for inline/conditional LLM calls, or pass scorers to mlflow.genai.evaluate(scorers=[...]) |