agent-evaluation
Agent Evaluation with MLflow
Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.
Table of Contents
Quick Start
Setup (prerequisite): Install MLflow 3.8+, configure environment, integrate tracing
Evaluation workflow in 4 steps:
- Understand: Run agent, inspect traces, understand purpose
- Define: Select/create scorers for quality criteria
- Dataset: ALWAYS discover existing datasets first, only create new if needed
- Evaluate: Run agent on dataset, apply scorers, analyze results
Command Conventions
Always use uv run for MLflow and Python commands:
uv run mlflow --version # MLflow CLI commands
uv run python scripts/xxx.py # Python script execution
uv run python -c "..." # Python one-liners
This ensures commands run in the correct environment with proper dependencies.
CRITICAL: Separate stderr from stdout when capturing CLI output:
When saving CLI command output to files for parsing (JSON, CSV, etc.), always redirect stderr separately to avoid mixing logs with structured data:
# WRONG - mixes progress bars and logs with JSON output
uv run mlflow traces evaluate ... --output json > results.json
# CORRECT - separates stderr from JSON output
uv run mlflow traces evaluate ... --output json 2>/dev/null > results.json
# ALTERNATIVE - save both separately for debugging
uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log
When to separate streams:
- Any command with
--output jsonflag - Commands that output structured data (CSV, JSON, XML)
- When piping output to parsing tools (
jq,grep, etc.)
When NOT to separate:
- Interactive commands where you want to see progress
- Debugging scenarios where logs provide context
- Commands that only output unstructured text
Documentation Access Protocol
All MLflow documentation must be accessed through llms.txt:
- Start at:
https://mlflow.org/docs/latest/llms.txt - Query llms.txt for your topic with specific prompt
- If llms.txt references another doc, use WebFetch with that URL
- Do not use WebSearch - use WebFetch with llms.txt first
This applies to all steps, especially:
- Dataset creation (read GenAI dataset docs from llms.txt)
- Scorer registration (check MLflow docs for scorer APIs)
- Evaluation execution (understand mlflow.genai.evaluate API)
Pre-Flight Validation
Validate environment before starting:
uv run mlflow --version # Should be >=3.8.0
uv run python -c "import mlflow; print(f'MLflow {mlflow.__version__} installed')"
If MLflow is missing or version is <3.8.0, see Setup Overview below.
Discovering Agent Structure
Each project has unique structure. Use dynamic exploration instead of assumptions:
Find Agent Entry Points
# Search for main agent functions
grep -r "def.*agent" . --include="*.py"
grep -r "def (run|stream|handle|process)" . --include="*.py"
# Check common locations
ls main.py app.py src/*/agent.py 2>/dev/null
# Look for API routes
grep -r "@app\.(get|post)" . --include="*.py" # FastAPI/Flask
grep -r "def.*route" . --include="*.py"
Find Tracing Integration
# Find autolog calls
grep -r "mlflow.*autolog" . --include="*.py"
# Find trace decorators
grep -r "@mlflow.trace" . --include="*.py"
# Check imports
grep -r "import mlflow" . --include="*.py"
Understand Project Structure
# Check entry points in package config
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts\|entry_points"
# Read project documentation
cat README.md docs/*.md 2>/dev/null | head -100
# Explore main directories
ls -la src/ app/ agent/ 2>/dev/null
Setup Overview
Before evaluation, complete these three setup steps:
- Install MLflow (version >=3.8.0)
- Configure environment (tracking URI and experiment)
- Guide: Follow
references/setup-guide.mdSteps 1-2
- Guide: Follow
- Integrate tracing (autolog and @mlflow.trace decorators)
- ⚠️ MANDATORY: Follow
references/tracing-integration.md- the authoritative tracing guide - ✓ VERIFY: Run
scripts/validate_agent_tracing.pyafter implementing
- ⚠️ MANDATORY: Follow
⚠️ Tracing must work before evaluation. If tracing fails, stop and troubleshoot.
Checkpoint - verify before proceeding:
- MLflow >=3.8.0 installed
- MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID set
- Autolog enabled and @mlflow.trace decorators added
- Test run creates a trace (verify trace ID is not None)
Validation scripts:
uv run python scripts/validate_environment.py # Check MLflow install, env vars, connectivity
uv run python scripts/validate_auth.py # Test authentication before expensive operations
For complete setup instructions: See references/setup-guide.md
Evaluation Workflow
Step 1: Understand Agent Purpose
- Invoke agent with sample input
- Inspect MLflow trace (especially LLM prompts describing agent purpose)
- Print your understanding and ask user for verification
- Wait for confirmation before proceeding
Step 2: Define Quality Scorers
-
Discover built-in scorers using documentation protocol:
- Query
https://mlflow.org/docs/latest/llms.txtfor "What built-in LLM judges or scorers are available?" - Read scorer documentation to understand their purpose and requirements
- Note: Do NOT use
mlflow scorers list -b- use documentation instead for accurate information
- Query
-
Check registered scorers in your experiment:
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID -
Identify quality dimensions for your agent and select appropriate scorers
-
Register scorers and test on sample trace before full evaluation
For scorer selection and registration: See references/scorers.md
For CLI constraints (yes/no format, template variables): See references/scorers-constraints.md
Step 3: Prepare Evaluation Dataset
ALWAYS discover existing datasets first to prevent duplicate work:
-
Run dataset discovery (mandatory):
uv run python scripts/list_datasets.py # Lists, compares, recommends datasets uv run python scripts/list_datasets.py --format json # Machine-readable output uv run python scripts/list_datasets.py --help # All options -
Present findings to user:
- Show all discovered datasets with their characteristics (size, topics covered)
- If datasets found, highlight most relevant options based on agent type
-
Ask user about existing datasets:
- "I found [N] existing evaluation dataset(s). Do you want to use one of these? (y/n)"
- If yes: Ask which dataset to use and record the dataset name
- If no: Proceed to step 4
-
Create new dataset only if user declined existing ones:
# Generates dataset creation script from test cases file uv run python scripts/create_dataset_template.py --test-cases-file test_cases.txt uv run python scripts/create_dataset_template.py --help # See all optionsGenerated code uses
mlflow.genai.datasetsAPIs - review and execute the script.
IMPORTANT: Do not skip dataset discovery. Always run list_datasets.py first, even if you plan to create a new dataset. This prevents duplicate work and ensures users are aware of existing evaluation datasets.
For complete dataset guide: See references/dataset-preparation.md
Step 4: Run Evaluation
-
Generate traces:
# Generates evaluation script (auto-detects agent module, entry point, dataset) uv run python scripts/run_evaluation_template.py uv run python scripts/run_evaluation_template.py --help # Override auto-detectionGenerated script uses
mlflow.genai.evaluate()- review and execute it. -
Apply scorers:
# IMPORTANT: Redirect stderr to avoid mixing logs with JSON output uv run mlflow traces evaluate \ --trace-ids <comma_separated_trace_ids> \ --scorers <scorer1>,<scorer2>,... \ --output json 2>/dev/null > evaluation_results.json -
Analyze results:
# Pattern detection, failure analysis, recommendations uv run python scripts/analyze_results.py evaluation_results.jsonGenerates
evaluation_report.mdwith pass rates and improvement suggestions.
References
Detailed guides in references/ (load as needed):
- setup-guide.md - Environment setup (MLflow install, tracking URI configuration)
- tracing-integration.md - Authoritative tracing guide (autolog, decorators, session tracking, verification)
- dataset-preparation.md - Dataset schema, APIs, creation, Unity Catalog
- scorers.md - Built-in vs custom scorers, registration, testing
- scorers-constraints.md - CLI requirements for custom scorers (yes/no format, templates)
- troubleshooting.md - Common errors by phase with solutions
Scripts are self-documenting - run with --help for usage details.