agent-evaluation
Agent Evaluation with MLflow
Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.
⛔ CRITICAL: Must Use MLflow APIs
DO NOT create custom evaluation frameworks. You MUST use MLflow's native APIs:
- Datasets: Use
mlflow.genai.datasets.create_dataset()- NOT custom test case files - Scorers: Use
mlflow.genai.scorersandmlflow.genai.judges.make_judge()- NOT custom scorer functions - Evaluation: Use
mlflow.genai.evaluate()- NOT custom evaluation loops - Scripts: Use the provided
scripts/directory templates - NOT customevaluation/directories
Why? MLflow tracks everything (datasets, scorers, traces, results) in the experiment. Custom frameworks bypass this and lose all observability.
If you're tempted to create evaluation/eval_dataset.py or similar custom files, STOP. Use scripts/create_dataset_template.py instead.
Table of Contents
Quick Start
⚠️ REMINDER: Use MLflow APIs from this skill. Do not create custom evaluation frameworks.
Setup (prerequisite): Install MLflow 3.8+, configure environment, integrate tracing
Evaluation workflow in 5 steps (each uses MLflow APIs):
- Understand: Run agent, inspect traces, understand purpose
- Scorers: Select and register scorers for quality criteria
- Dataset: ALWAYS discover existing datasets first, only create new if needed 3.5. Dry Run: Run 3 questions first — catch broken tools and misconfigured scorers before full eval
- Evaluate: Run agent on dataset, apply scorers, analyze results
Command Conventions
Always use uv run for MLflow and Python commands:
uv run mlflow --version # MLflow CLI commands
uv run python scripts/xxx.py # Python script execution
uv run python -c "..." # Python one-liners
This ensures commands run in the correct environment with proper dependencies.
CRITICAL: Separate stderr from stdout when capturing CLI output:
When saving CLI command output to files for parsing (JSON, CSV, etc.), always redirect stderr separately to avoid mixing logs with structured data:
# Save both separately for debugging
uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log
Documentation Access Protocol
All MLflow documentation must be accessed through llms.txt:
- Start at:
https://mlflow.org/docs/latest/llms.txt - Query llms.txt for your topic with specific prompt
- If llms.txt references another doc, use WebFetch with that URL
- Do not use WebSearch - use WebFetch with llms.txt first
This applies to all steps, especially:
- Dataset creation (read GenAI dataset docs from llms.txt)
- Scorer registration (check MLflow docs for scorer APIs)
- Evaluation execution (understand mlflow.genai.evaluate API)
Discovering Agent Structure
Each project has unique structure. Use dynamic exploration instead of assumptions:
Find Agent Entry Points
# Search for main agent functions
grep -r "def.*agent" . --include="*.py"
grep -r "def (run|stream|handle|process)" . --include="*.py"
# Check common locations
ls main.py app.py src/*/agent.py 2>/dev/null
# Look for API routes
grep -r "@app\.(get|post)" . --include="*.py" # FastAPI/Flask
grep -r "def.*route" . --include="*.py"
Understand Project Structure
# Check entry points in package config
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts\|entry_points"
# Read project documentation
cat README.md docs/*.md 2>/dev/null | head -100
# Explore main directories
ls -la src/ app/ agent/ 2>/dev/null
Setup Overview
Pre-check: Use Existing Environment
Before doing ANY setup, check if MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID are already set:
echo "MLFLOW_TRACKING_URI=$MLFLOW_TRACKING_URI"
echo "MLFLOW_EXPERIMENT_ID=$MLFLOW_EXPERIMENT_ID"
If BOTH are already set, skip Steps 1-2 entirely. The environment is pre-configured. Do NOT run setup_mlflow.py, do NOT create a .env file, do NOT override these values. Go directly to Step 3 (tracing integration) and the evaluation workflow.
Setup Steps (only if environment is NOT pre-configured)
- Install MLflow (version >=3.8.0)
- Configure environment (tracking URI and experiment)
- Guide: Follow
references/setup-guide.mdSteps 1-2
- Guide: Follow
- Integrate tracing (autolog and @mlflow.trace decorators)
- ⚠️ MANDATORY: Use the
instrumenting-with-mlflow-tracingskill for tracing setup - ✓ VERIFY: Run
scripts/validate_tracing_runtime.pyafter implementing
- ⚠️ MANDATORY: Use the
⚠️ Tracing must work before evaluation. If tracing fails, stop and troubleshoot.
Checkpoint - verify before proceeding:
- MLflow >=3.8.0 installed
- MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID set
- Autolog enabled and @mlflow.trace decorators added
- Test run creates a trace (verify trace ID is not None)
Validation scripts:
uv run python scripts/validate_environment.py # Check MLflow install, env vars, connectivity
uv run python scripts/validate_auth.py # Test authentication before expensive operations
Evaluation Workflow
Step 1: Agent Interview (REQUIRED — do not skip)
Before doing anything else, ask the user these questions. Do NOT proceed until you have answers.
Required:
- "What does your agent do? Describe its purpose in 1-2 sentences."
- "What are the 2-3 most important things it needs to get right?"
- "Are there common failure modes you've already noticed?"
Use answers to:
- Derive scorer names and criteria (do not invent them)
- Write the
agent_descriptionparameter forgenerate_evals_df - Set evaluation priorities
If running in automated mode: Read agent purpose from the codebase (SKILL.md, README, or main entry point docstring). Still surface what you found and confirm before proceeding.
Step 2: Define Quality Scorers
- Check registered scorers in your experiment:
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID
IMPORTANT: if there are registered scorers in the experiment then they must be used for evaluation.
- Select additional built-in scorers that apply to the agent
See references/scorers.md for the built-in scorers. Select any that are useful for assessing the agent's quality and that are not already registered.
- Create additional custom scorers as needed
If needed, create additional scorers using the make_judge() API. See references/scorers.md on how to create custom scorers and references/scorers-constraints.md for best practices.
⚠️ CRITICAL — Scorer Return Values: Scorers MUST instruct the LLM judge to return
"yes"or"no"(or booleans/numerics). Return values of"pass"or"fail"are silently cast toNoneby_cast_assessment_value_to_floatand excluded fromresults.metricswith no error or warning — results simply disappear. Seereferences/scorers-constraints.mdConstraint 2 for the full list of safe vs. broken return values.
-
REQUIRED: Register new scorers before evaluation using Python API:
from mlflow.genai.judges import make_judge from mlflow.genai.scorers import BuiltinScorerName import os scorer = make_judge(...) # Or, scorer = BuiltinScorerName() scorer.register()
** IMPORTANT: See references/scorers.md → "Model Selection for Scorers" to configure the model parameter of scorers before registration.
⚠️ Scorers MUST be registered before evaluation. Inline scorers that aren't registered won't appear in mlflow scorers list and won't be reusable.
- Verify registration:
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID # Should show your scorers
Step 3: Prepare Evaluation Dataset
ALWAYS discover existing datasets first to prevent duplicate work:
-
Run dataset discovery (mandatory):
uv run python scripts/list_datasets.py # Lists, compares, recommends datasets uv run python scripts/list_datasets.py --format json # Machine-readable output uv run python scripts/list_datasets.py --help # All options -
Present findings to user:
- Show all discovered datasets with their characteristics (size, topics covered)
- If datasets found, highlight most relevant options based on agent type
-
Ask user about existing datasets:
- "I found [N] existing evaluation dataset(s). Do you want to use one of these? (y/n)"
- If yes: Ask which dataset to use and record the dataset name — skip to Step 3.5
- If no: Proceed to Phase A below
If creating a new dataset, use the two-phase approach below.
Phase A: Sanity Check (5 questions — always run first)
Create a minimal 5-question dataset manually from the Step 1 interview answers. The goal is to confirm the pipeline works end-to-end before investing in large-scale generation.
import mlflow
from mlflow.genai.datasets import create_dataset
# Derive 5 representative questions directly from the agent's stated purpose
# and known failure modes identified in Step 1
sanity_records = [
{"inputs": {"query": "<question 1 from interview>"}, "expected_response": "<expected answer>"},
{"inputs": {"query": "<question 2 from interview>"}, "expected_response": "<expected answer>"},
# ... 5 total
]
sanity_dataset = create_dataset(
records=sanity_records,
name="sanity-check-5q",
)
Run evaluation on this dataset (see Step 4), then present results to the user with this framing:
"This is a sanity check — 5 questions confirm the pipeline works but aren't statistically meaningful. Proceeding to Phase B to generate a proper evaluation set."
Only proceed to Phase B once Phase A completes without errors.
Phase B: Proper Evaluation Dataset (100+ questions — run after Phase A passes)
Generate questions from the agent's actual corpus rather than inventing them from scratch. The approach depends on whether the project uses Databricks or OSS MLflow.
On Databricks — use generate_evals_df to synthesize questions from the agent's document corpus:
from databricks.agents.evals import generate_evals_df, estimate_synthetic_num_evals
import mlflow
# agent_description comes from Step 1 interview answers
agent_description = "<agent purpose from interview>"
# docs_df: a Spark or pandas DataFrame with a "content" column containing
# the documents/chunks the agent retrieves from (e.g., your Vector Search index)
evals = generate_evals_df(
docs=docs_df,
num_evals=100,
agent_description=agent_description,
)
# Merge into MLflow dataset — don't create a separate dataset
dataset = mlflow.genai.datasets.create_dataset(name="generated-evals-100q")
dataset.merge_records(evals)
To estimate the right num_evals before generating:
recommended = estimate_synthetic_num_evals(docs_df)
print(f"Recommended num_evals: {recommended}")
Dataset size guidance:
- <30 questions: not statistically meaningful — avoid drawing conclusions
- 50–100 questions: adequate for catching regressions, suitable for most agents
- 200+ questions: recommended when comparing model variants or scoring multiple dimensions
On OSS MLflow — use RAGAS TestsetGenerator to generate from your document corpus:
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
generator = TestsetGenerator(
llm=LangchainLLMWrapper(your_llm),
embedding_model=LangchainEmbeddingsWrapper(your_embeddings),
)
testset = generator.generate_with_langchain_docs(docs, testset_size=100)
evals_df = testset.to_pandas()
# Convert to MLflow dataset schema and merge
import mlflow
records = [
{"inputs": {"query": row["user_input"]}, "expected_response": row["reference"]}
for _, row in evals_df.iterrows()
]
dataset = mlflow.genai.datasets.create_dataset(name="generated-evals-100q")
dataset.merge_records(records)
If no document corpus is available — ask the user to provide 50–100 representative queries from production logs or usage history. These are more realistic than synthetic questions and are preferable when available.
IMPORTANT: Do not skip dataset discovery. Always run list_datasets.py first, even if you plan to create a new dataset. This prevents duplicate work and ensures users are aware of existing evaluation datasets.
For complete dataset guide: See references/dataset-preparation.md
Checkpoint - verify before proceeding:
- Scorers have been registered
- Phase A sanity check passed (pipeline runs end-to-end)
- Phase B dataset created with 50+ questions (or existing dataset selected)
Step 3.5: Dry Run (REQUIRED before full eval)
Run evaluation on 3 questions from the dataset before committing to the full run. This catches broken tools, misconfigured scorers, and auth failures early — before they silently corrupt 100-question results.
If you completed Phase A above, the pipeline is already validated — focus the dry run on scorer output only.
import mlflow
dataset = mlflow.genai.datasets.get_dataset(name="<your-dataset-name>")
dry_run_records = dataset.df.head(3)
Run mlflow.genai.evaluate() on these 3 records using the same wrapper and scorers as the full eval.
For each response, check:
- Tool calls — Did the agent call any tools? If it called zero tools on questions that require retrieval, tools are likely broken (403s, rate limits, missing credentials).
- Response quality — Are responses empty or generic ("I don't know", "I can't help with that")? Empty responses score as irrelevant and will skew the full eval.
- Scorer output — Did all 3 scores come back as
0orNone? If so, the scorer is misconfigured (check return values —"pass"/"fail"are silently cast toNone; use"yes"/"no"instead).
Decision gate:
- If dry run shows tool failures or empty responses: Stop. Fix the underlying issue (auth, tool config, retrieval) before proceeding. Do not run the full eval on broken infrastructure.
- If all 3 scorer outputs are 0 or None: Stop. Debug scorer return values and re-register before proceeding.
- If dry run passes: Report to the user: "Dry run passed (3/3 responses non-empty, tools called, scores non-zero). Proceeding to full eval." Then continue to Step 4.
Why this matters: Tool failures (403s from docs scraping, GitHub API rate limits) produce empty agent responses that score as 0. Running a 100-question eval only to discover all tools were failing wastes time and produces misleading results. The dry run catches this in under a minute.
Step 4: Run Evaluation
Large datasets (50+ questions)? See
references/throughput-guide.mdfor throughput optimization — covers parallelism env vars, async predict functions, and dataset splitting for 200+ question evals.
4a. Estimate Runtime Before Starting
Before launching evaluation, tell the user how long it will take:
-
Count the dataset questions:
import mlflow dataset = mlflow.genai.datasets.get_dataset(name="<your-dataset-name>") print(f"Dataset size: {len(dataset.df)} questions") -
Calculate the estimate — each question runs the agent once and the judge scorer once:
- Opus-class judge models (e.g.
claude-opus-4): ~45–90s per question - Sonnet-class judge models (e.g.
claude-sonnet-4): ~20–45s per question - Multiple scorers per question add time proportionally
Estimated time = N questions × 30–60s per question ÷ parallelism factor (typically 4–8x) - Opus-class judge models (e.g.
-
Tell the user before starting:
"This dataset has N questions. At ~30–60s per question with typical parallelism, evaluation will take approximately X–Y minutes. I'll run it as a background task so you can continue working — I'll summarize the results when it's done."
4b. Generate the Evaluation Script
# Generate evaluation script (specify module and entry point)
uv run python scripts/run_evaluation_template.py \
--module mlflow_agent.agent \
--entry-point run_agent
The generated script creates a wrapper function that:
- Accepts keyword arguments matching the dataset's input keys
- Provides any additional arguments the agent needs (like
llm_provider) - Runs
mlflow.genai.evaluate(data=df, predict_fn=wrapper, scorers=registered_scorers) - Saves results to
evaluation_results.csv
⚠️ CRITICAL: wrapper Signature Must Match Dataset Input Keys
MLflow calls predict_fn(**inputs) - it unpacks the inputs dict as keyword arguments.
| Dataset Record | MLflow Calls | predict_fn Must Be |
|---|---|---|
{"inputs": {"query": "..."}} |
predict_fn(query="...") |
def wrapper(query): |
{"inputs": {"question": "...", "context": "..."}} |
predict_fn(question="...", context="...") |
def wrapper(question, context): |
Common Mistake (WRONG):
def wrapper(inputs): # ❌ WRONG - inputs is NOT a dict
return agent(inputs["query"])
4c. Launch as a Background Sub-Agent
Run the evaluation as a background sub-agent so the main session stays available. Use the Agent tool with run_in_background: true:
Sub-agent instructions (pass these verbatim):
Run the agent evaluation and write results to scratchpad.
Steps:
1. cd <project-directory>
2. Run: uv run python run_agent_evaluation.py
3. When complete, write a summary to scratchpad/eval-results.md with:
- Exit status (success or error message)
- Path to results file (evaluation_results.csv)
- Wall-clock time taken
4. Return only: "Evaluation complete. Results written to scratchpad/eval-results.md"
In the main session, poll for completion by checking for the scratchpad file rather than blocking:
# Poll every 30s using Glob
# Glob("scratchpad/eval-results.md")
# When the file appears, read it and proceed to analysis
Do NOT use TaskOutput to wait for the background agent — that dumps the full transcript (~10–20k tokens) into the main context.
4d. Analyze Results (after evaluation completes)
Once scratchpad/eval-results.md appears, run analysis:
# Pattern detection, failure analysis, recommendations
# Reads the CSV produced by mlflow.genai.evaluate() above
uv run python scripts/analyze_results.py evaluation_results.csv
Generates evaluation_report.md with per-scorer pass rates and improvement suggestions.
The script reads {scorer_name}/value and {scorer_name}/rationale columns from the CSV.
It also accepts the legacy JSON format from mlflow traces evaluate for backward compatibility:
uv run python scripts/analyze_results.py evaluation_results.json # legacy format
uv run python scripts/analyze_results.py evaluation_results.csv --output my_report.md # custom output
References
Detailed guides in references/ (load as needed):
- setup-guide.md - Environment setup (MLflow install, tracking URI configuration)
- Tracing: Use the
instrumenting-with-mlflow-tracingskill (authoritative guide for autolog, decorators, session tracking, verification) - dataset-preparation.md - Dataset schema, APIs, creation, Unity Catalog
- scorers.md - Built-in vs custom scorers, registration, testing
- scorers-constraints.md - CLI requirements for custom scorers (yes/no format, templates)
- troubleshooting.md - Common errors by phase with solutions
- throughput-guide.md - Parallelism env vars, async predict_fn, dataset splitting for 200+ question evals
Scripts are self-documenting - run with --help for usage details.
More from mlflow/skills
searching-mlflow-docs
Searches and retrieves MLflow documentation from the official docs site. Use when the user asks about MLflow features, APIs, integrations (LangGraph, LangChain, OpenAI, etc.), tracing, tracking, or requests to look up MLflow documentation. Triggers on "how do I use MLflow with X", "find MLflow docs for Y", "MLflow API for Z".
259instrumenting-with-mlflow-tracing
Instruments Python and TypeScript code with MLflow Tracing for observability. Must be loaded when setting up tracing as part of any workflow including agent evaluation. Triggers on adding tracing, instrumenting agents/LLM apps, getting started with MLflow tracing, tracing specific frameworks (LangGraph, LangChain, OpenAI, DSPy, CrewAI, AutoGen), or when another skill references tracing setup. Examples - "How do I add tracing?", "Instrument my agent", "Trace my LangChain app", "Set up tracing for evaluation
231analyzing-mlflow-session
Analyzes an MLflow session — a sequence of traces from a multi-turn chat conversation or interaction. Use when the user asks to debug a chat conversation, review session or chat history, find where a multi-turn chat went wrong, or analyze patterns across turns. Triggers on "analyze this session", "what happened in this conversation", "debug session", "review chat history", "where did this chat go wrong", "session traces", "analyze chat", "debug this chat".
220mlflow-onboarding
Onboards users to MLflow by determining their use case (GenAI agents/apps or traditional ML/deep learning) and guiding them through relevant quickstart tutorials and initial integration. If an experiment ID is available, it should be supplied as input to help determine the use case. Use when the user asks to get started with MLflow, set up tracking, add observability, or integrate MLflow into their project. Triggers on "get started with MLflow", "set up MLflow", "onboard to MLflow", "add MLflow to my project", "how do I use MLflow".
219analyzing-mlflow-trace
Analyzes a single MLflow trace to answer a user query about it. Use when the user provides a trace ID and asks to debug, investigate, find issues, root-cause errors, understand behavior, or analyze quality. Triggers on "analyze this trace", "what went wrong with this trace", "debug trace", "investigate trace", "why did this trace fail", "root cause this trace".
217retrieving-mlflow-traces
Retrieves MLflow traces using CLI or Python API. Use when the user asks to get a trace by ID, find traces, filter traces by status/tags/metadata/execution time, query traces, or debug failed traces. Triggers on "get trace", "search traces", "find failed traces", "filter traces by", "traces slower than", "query MLflow traces".
217