querying-mlflow-metrics
MLflow Metrics
Run scripts/fetch_metrics.py to query metrics from an MLflow tracking server.
Examples
Token usage summary:
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m total_tokens -a SUM,AVG
Output: AVG: 223.91 SUM: 7613
Hourly token trend (last 24h):
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m total_tokens -a SUM \
-t 3600 --start-time="-24h" --end-time=now
Output: Time-bucketed token sums per hour
Latency percentiles by trace:
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m latency -a AVG,P95 -d trace_name
Error rate by status:
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m trace_count -a COUNT -d trace_status
Quality scores by evaluator (assessments):
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -v ASSESSMENTS \
-m assessment_value -a AVG,P50 -d assessment_name
Output: Average and median scores for each evaluator (e.g., correctness, relevance)
Assessment count by name:
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -v ASSESSMENTS \
-m assessment_count -a COUNT -d assessment_name
JSON output: Add -o json to any command.
Arguments
| Arg | Required | Description |
|---|---|---|
-s, --server |
Yes | MLflow server URL |
-x, --experiment-ids |
Yes | Experiment IDs (comma-separated) |
-m, --metric |
Yes | trace_count, latency, input_tokens, output_tokens, total_tokens |
-a, --aggregations |
Yes | COUNT, SUM, AVG, MIN, MAX, P50, P95, P99 |
-d, --dimensions |
No | Group by: trace_name, trace_status |
-t, --time-interval |
No | Bucket size in seconds (3600=hourly, 86400=daily) |
--start-time |
No | -24h, -7d, now, ISO 8601, or epoch ms |
--end-time |
No | Same formats as start-time |
-o, --output |
No | table (default) or json |
For SPANS metrics (span_count, latency), add -v SPANS.
For ASSESSMENTS metrics, add -v ASSESSMENTS.
See references/api_reference.md for filter syntax and full API details.
More from b-step62/skills
agent-evaluation
Use this when you need to IMPROVE or OPTIMIZE an existing LLM agent's performance - including improving tool selection accuracy, answer quality, reducing costs, or fixing issues where the agent gives wrong/incomplete responses. Evaluates agents systematically using MLflow evaluation with datasets, scorers, and tracing. Covers end-to-end evaluation workflow or individual components (tracing setup, dataset creation, scorer definition, evaluation execution).
9searching-mlflow-docs
Searches and retrieves MLflow documentation from the official docs site. Use when the user asks about MLflow features, APIs, integrations (LangGraph, LangChain, OpenAI, etc.), tracing, tracking, or requests to look up MLflow documentation. Triggers on "how do I use MLflow with X", "find MLflow docs for Y", "MLflow API for Z".
8searching-mlflow-traces
Searches and filters MLflow traces using CLI or Python API. Use when the user asks to find traces, filter traces by status/tags/metadata/execution time, query traces, or debug failed traces. Triggers on "search traces", "find failed traces", "filter traces by", "traces slower than", "query MLflow traces".
7instrumenting-with-mlflow-tracing
Instruments code with MLflow Tracing for observability. Triggers on questions about adding tracing, instrumenting agents/LLM apps, getting started with MLflow tracing, or tracing specific frameworks (LangGraph, LangChain, OpenAI, DSPy, CrewAI, AutoGen). Examples - "How do I add tracing?", "How to instrument my agent?", "How to trace my LangChain app?", "Getting started with MLflow tracing
7