agent-evaluation
Installation
SKILL.md
Agent Evaluation with MLflow
Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.
⛔ CRITICAL: Must Use MLflow APIs
DO NOT create custom evaluation frameworks. You MUST use MLflow's native APIs:
- Datasets: Use
mlflow.genai.datasets.create_dataset()- NOT custom test case files - Scorers: Use
mlflow.genai.scorersandmlflow.genai.judges.make_judge()- NOT custom scorer functions - Evaluation: Use
mlflow.genai.evaluate()- NOT custom evaluation loops - Scripts: Use the provided
scripts/directory templates - NOT customevaluation/directories
Why? MLflow tracks everything (datasets, scorers, traces, results) in the experiment. Custom frameworks bypass this and lose all observability.
If you're tempted to create evaluation/eval_dataset.py or similar custom files, STOP. Use scripts/create_dataset_template.py instead.