agent-evaluation
Agent Evaluation with MLflow
Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.
Table of Contents
Quick Start
Setup (prerequisite): Install MLflow 3.8+, configure environment, integrate tracing
Evaluation workflow in 4 steps:
- Understand: Run agent, inspect traces, understand purpose
- Define: Select/create scorers for quality criteria
- Dataset: ALWAYS discover existing datasets first, only create new if needed
- Evaluate: Run agent on dataset, apply scorers, analyze results
Command Conventions
Always use uv run for MLflow and Python commands:
uv run mlflow --version # MLflow CLI commands
uv run python scripts/xxx.py # Python script execution
uv run python -c "..." # Python one-liners
This ensures commands run in the correct environment with proper dependencies.
CRITICAL: Separate stderr from stdout when capturing CLI output:
When saving CLI command output to files for parsing (JSON, CSV, etc.), always redirect stderr separately to avoid mixing logs with structured data:
# WRONG - mixes progress bars and logs with JSON output
uv run mlflow traces evaluate ... --output json > results.json
# CORRECT - separates stderr from JSON output
uv run mlflow traces evaluate ... --output json 2>/dev/null > results.json
# ALTERNATIVE - save both separately for debugging
uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log
When to separate streams:
- Any command with
--output jsonflag - Commands that output structured data (CSV, JSON, XML)
- When piping output to parsing tools (
jq,grep, etc.)
When NOT to separate:
- Interactive commands where you want to see progress
- Debugging scenarios where logs provide context
- Commands that only output unstructured text
Documentation Access Protocol
All MLflow documentation must be accessed through llms.txt:
- Start at:
https://mlflow.org/docs/latest/llms.txt - Query llms.txt for your topic with specific prompt
- If llms.txt references another doc, use WebFetch with that URL
- Do not use WebSearch - use WebFetch with llms.txt first
This applies to all steps, especially:
- Dataset creation (read GenAI dataset docs from llms.txt)
- Scorer registration (check MLflow docs for scorer APIs)
- Evaluation execution (understand mlflow.genai.evaluate API)
Pre-Flight Validation
Validate environment before starting:
uv run mlflow --version # Should be >=3.8.0
uv run python -c "import mlflow; print(f'MLflow {mlflow.__version__} installed')"
If MLflow is missing or version is <3.8.0, see Setup Overview below.
Discovering Agent Structure
Each project has unique structure. Use dynamic exploration instead of assumptions:
Find Agent Entry Points
# Search for main agent functions
grep -r "def.*agent" . --include="*.py"
grep -r "def (run|stream|handle|process)" . --include="*.py"
# Check common locations
ls main.py app.py src/*/agent.py 2>/dev/null
# Look for API routes
grep -r "@app\.(get|post)" . --include="*.py" # FastAPI/Flask
grep -r "def.*route" . --include="*.py"
Find Tracing Integration
# Find autolog calls
grep -r "mlflow.*autolog" . --include="*.py"
# Find trace decorators
grep -r "@mlflow.trace" . --include="*.py"
# Check imports
grep -r "import mlflow" . --include="*.py"
Understand Project Structure
# Check entry points in package config
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts\|entry_points"
# Read project documentation
cat README.md docs/*.md 2>/dev/null | head -100
# Explore main directories
ls -la src/ app/ agent/ 2>/dev/null
Setup Overview
Before evaluation, complete these three setup steps:
- Install MLflow (version >=3.8.0)
- Configure environment (tracking URI and experiment)
- Guide: Follow
references/setup-guide.mdSteps 1-2
- Guide: Follow
- Integrate tracing (autolog and @mlflow.trace decorators)
- ⚠️ MANDATORY: Follow
references/tracing-integration.md- the authoritative tracing guide - ✓ VERIFY: Run
scripts/validate_agent_tracing.pyafter implementing
- ⚠️ MANDATORY: Follow
⚠️ Tracing must work before evaluation. If tracing fails, stop and troubleshoot.
Checkpoint - verify before proceeding:
- MLflow >=3.8.0 installed
- MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID set
- Autolog enabled and @mlflow.trace decorators added
- Test run creates a trace (verify trace ID is not None)
Validation scripts:
uv run python scripts/validate_environment.py # Check MLflow install, env vars, connectivity
uv run python scripts/validate_auth.py # Test authentication before expensive operations
For complete setup instructions: See references/setup-guide.md
Evaluation Workflow
Step 1: Understand Agent Purpose
- Invoke agent with sample input
- Inspect MLflow trace (especially LLM prompts describing agent purpose)
- Print your understanding and ask user for verification
- Wait for confirmation before proceeding
Step 2: Define Quality Scorers
-
Discover built-in scorers using documentation protocol:
- Query
https://mlflow.org/docs/latest/llms.txtfor "What built-in LLM judges or scorers are available?" - Read scorer documentation to understand their purpose and requirements
- Note: Do NOT use
mlflow scorers list -b- use documentation instead for accurate information
- Query
-
Check registered scorers in your experiment:
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID -
Identify quality dimensions for your agent and select appropriate scorers
-
Register scorers and test on sample trace before full evaluation
For scorer selection and registration: See references/scorers.md
For CLI constraints (yes/no format, template variables): See references/scorers-constraints.md
Step 3: Prepare Evaluation Dataset
ALWAYS discover existing datasets first to prevent duplicate work:
-
Run dataset discovery (mandatory):
uv run python scripts/list_datasets.py # Lists, compares, recommends datasets uv run python scripts/list_datasets.py --format json # Machine-readable output uv run python scripts/list_datasets.py --help # All options -
Present findings to user:
- Show all discovered datasets with their characteristics (size, topics covered)
- If datasets found, highlight most relevant options based on agent type
-
Ask user about existing datasets:
- "I found [N] existing evaluation dataset(s). Do you want to use one of these? (y/n)"
- If yes: Ask which dataset to use and record the dataset name
- If no: Proceed to step 4
-
Create new dataset only if user declined existing ones:
# Generates dataset creation script from test cases file uv run python scripts/create_dataset_template.py --test-cases-file test_cases.txt uv run python scripts/create_dataset_template.py --help # See all optionsGenerated code uses
mlflow.genai.datasetsAPIs - review and execute the script.
IMPORTANT: Do not skip dataset discovery. Always run list_datasets.py first, even if you plan to create a new dataset. This prevents duplicate work and ensures users are aware of existing evaluation datasets.
For complete dataset guide: See references/dataset-preparation.md
Step 4: Run Evaluation
-
Generate traces:
# Generates evaluation script (auto-detects agent module, entry point, dataset) uv run python scripts/run_evaluation_template.py uv run python scripts/run_evaluation_template.py --help # Override auto-detectionGenerated script uses
mlflow.genai.evaluate()- review and execute it. -
Apply scorers:
# IMPORTANT: Redirect stderr to avoid mixing logs with JSON output uv run mlflow traces evaluate \ --trace-ids <comma_separated_trace_ids> \ --scorers <scorer1>,<scorer2>,... \ --output json 2>/dev/null > evaluation_results.json -
Analyze results:
# Pattern detection, failure analysis, recommendations uv run python scripts/analyze_results.py evaluation_results.jsonGenerates
evaluation_report.mdwith pass rates and improvement suggestions.
References
Detailed guides in references/ (load as needed):
- setup-guide.md - Environment setup (MLflow install, tracking URI configuration)
- tracing-integration.md - Authoritative tracing guide (autolog, decorators, session tracking, verification)
- dataset-preparation.md - Dataset schema, APIs, creation, Unity Catalog
- scorers.md - Built-in vs custom scorers, registration, testing
- scorers-constraints.md - CLI requirements for custom scorers (yes/no format, templates)
- troubleshooting.md - Common errors by phase with solutions
Scripts are self-documenting - run with --help for usage details.
More from b-step62/skills
searching-mlflow-docs
Searches and retrieves MLflow documentation from the official docs site. Use when the user asks about MLflow features, APIs, integrations (LangGraph, LangChain, OpenAI, etc.), tracing, tracking, or requests to look up MLflow documentation. Triggers on "how do I use MLflow with X", "find MLflow docs for Y", "MLflow API for Z".
8querying-mlflow-metrics
Fetches aggregated trace metrics (token usage, latency, trace counts, quality evaluations) from MLflow tracking servers. Triggers on requests to show metrics, analyze token usage, view LLM costs, check usage trends, or query trace statistics.
8searching-mlflow-traces
Searches and filters MLflow traces using CLI or Python API. Use when the user asks to find traces, filter traces by status/tags/metadata/execution time, query traces, or debug failed traces. Triggers on "search traces", "find failed traces", "filter traces by", "traces slower than", "query MLflow traces".
7instrumenting-with-mlflow-tracing
Instruments code with MLflow Tracing for observability. Triggers on questions about adding tracing, instrumenting agents/LLM apps, getting started with MLflow tracing, or tracing specific frameworks (LangGraph, LangChain, OpenAI, DSPy, CrewAI, AutoGen). Examples - "How do I add tracing?", "How to instrument my agent?", "How to trace my LangChain app?", "Getting started with MLflow tracing
7