skills/mlflow/skills/analyzing-mlflow-session

analyzing-mlflow-session

SKILL.md

Analyzing an MLflow Chat Session

What is a Session?

A session groups multiple traces that belong to the same chat conversation or user interaction. Each trace in the session represents one turn: the user's input and the system's response. Traces within a session are linked by a shared session ID stored in trace metadata.

The session ID is stored in trace metadata under the key mlflow.trace.session. This key contains dots, which affects filter syntax (see below). All traces sharing the same value for this key belong to the same session.

Reconstructing the Conversation

Reconstructing a session's conversation is a multi-step process: discover the input/output schema from the first trace, extract those fields efficiently across all session traces, then inspect specific turns as needed. Do NOT fetch full traces for every turn — use --extract-fields on the search command instead.

Step 1: Discover the schema. First, find a trace ID from the session, then fetch its full JSON to inspect the schema:

# Get the first trace in the session
mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  --filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \
  --order-by "timestamp_ms ASC" \
  --extract-fields 'info.trace_id' \
  --output json \
  --max-results 1 > /tmp/first_trace.json

# Fetch the full trace (always outputs JSON, no --output flag needed)
mlflow traces get \
  --trace-id <TRACE_ID_FROM_ABOVE> > /tmp/trace_detail.json

Find the root span — the span with parent_span_id equal to null (i.e., it has no parent). This is the top-level operation in the trace:

# Find the root span
jq '.data.spans[] | select(.parent_span_id == null)' /tmp/trace_detail.json

Examine its attributes dict to identify which keys hold the user input and system output. These could be:

  • MLflow standard attributes: mlflow.spanInputs and mlflow.spanOutputs (set by the MLflow Python client)
  • Custom attributes: Application-specific keys set via @mlflow.trace or mlflow.start_span() with custom attribute logging
  • Third-party OTel attributes: Keys following GenAI Semantic Conventions, OpenInference, or other instrumentation conventions

The structure of these values also varies by application (e.g., a query string, a messages array, a dict with multiple fields). Inspect the actual attribute values to understand the format.

If the root span has empty or missing inputs/outputs, it may be a wrapper span (e.g., an orchestrator or middleware) that doesn't directly carry the chat turn data. In that case, look at its immediate children — find the closest span to the top of the hierarchy that has meaningful inputs and outputs corresponding to a chat turn:

The following example assumes the trace comes from the MLflow Python client (which stores inputs/outputs in mlflow.spanInputs/mlflow.spanOutputs) and that the relevant span is a direct child of root. In practice, the relevant span may be deeper in the hierarchy, and traces from other clients may use different attribute keys — explore the span tree as needed:

# Get the root span's ID
ROOT_ID=$(jq -r '.data.spans[] | select(.parent_span_id == null) | .span_id' /tmp/trace_detail.json)

# List immediate children of the root span with their inputs/outputs
jq --arg root "$ROOT_ID" '.data.spans[] | select(.parent_span_id == $root) | {name: .name, inputs: .attributes["mlflow.spanInputs"], outputs: .attributes["mlflow.spanOutputs"]}' /tmp/trace_detail.json

Also check the first trace's assessments. Session-level assessments are attached to the first trace in the session — these evaluate the session as a whole (e.g., overall conversation quality, multi-turn coherence) and can indicate the presence of issues somewhere across the entire session, not just the first turn. The first trace may also have per-turn assessments for that specific turn.

Both types appear in .info.assessments. Session-level assessments are identified by the presence of mlflow.trace.session in their metadata field:

# Show session-level assessments (exclude scorer errors)
jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"]) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json

# Show per-turn assessments (exclude scorer errors)
jq '[.info.assessments[] | select(.feedback.error == null) | select(.metadata["mlflow.trace.session"] == null) | {name: .assessment_name, value: .feedback.value}]' /tmp/trace_detail.json

Assessment errors are not trace errors. If an assessment has a feedback.error field, it means the scorer or judge failed — not that the trace itself has a problem. Exclude these when using assessments to identify trace issues.

Always consult the rationale when interpreting assessment values. The value alone can be misleading — for example, a user_frustration assessment with value: "no" could mean "no frustration detected" or "the frustration check did not pass" (i.e., frustration is present), depending on how the scorer was configured. The .rationale field (a top-level assessment field, not nested under .feedback) explains what the value means in context. Include rationale when extracting assessments:

jq '[.info.assessments[] | select(.feedback.error == null) | {name: .assessment_name, value: .feedback.value, rationale: .rationale}]' /tmp/trace_detail.json

Step 2: Extract across all session traces. Once you know which attribute keys hold inputs and outputs, search for all traces in the session using --extract-fields to pull those fields along with assessments (see Handling CLI Output for why output is written to a file):

mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  --filter-string 'metadata.`mlflow.trace.session` = "<SESSION_ID>"' \
  --order-by "timestamp_ms ASC" \
  --extract-fields 'info.trace_id,info.state,info.request_time,info.assessments,info.trace_metadata.`mlflow.traceInputs`,info.trace_metadata.`mlflow.traceOutputs`' \
  --output json \
  --max-results 100 > /tmp/session_traces.json

Then use bash commands (e.g., jq, wc, head) on the file to analyze it.

The --extract-fields example above uses mlflow.traceInputs/mlflow.traceOutputs from trace metadata — adjust the field paths based on what you discovered in step 1.

Assessments contain quality judgments (e.g., correctness, relevance) that can pinpoint which turns had issues without needing to read every trace in detail. To identify which turns have assessment signals (excluding scorer errors):

# List turns with their valid assessments (scorer errors filtered out)
jq '.traces[] | {
  trace_id: .info.trace_id,
  time: .info.request_time,
  state: .info.state,
  assessments: [.info.assessments[]? | select(.feedback.error == null) | {
    name: .assessment_name,
    value: .feedback.value
  }]
}' /tmp/session_traces.json

CLI syntax notes:

  • --experiment-id is required for all mlflow traces search commands. The command will fail without it.
  • Metadata keys containing dots must be escaped with backticks in filter strings and extract-fields: metadata.`mlflow.trace.session`
  • Shell quoting: Backticks inside double quotes are interpreted by bash as command substitution (e.g., bash will try to run `mlflow.trace.session` as a command). Always use single quotes for the outer string when the value contains backticks. For example: --filter-string 'metadata.\mlflow.trace.session` = "value"'`
  • --max-results defaults to 100, which is sufficient for most sessions. Increase up to 500 (the maximum) for longer conversations. If 500 results are returned, use pagination to retrieve the rest.

Handling CLI Output

MLflow trace output can be large, and Claude Code's Bash tool has a ~30KB output limit for piped commands. When output exceeds this threshold, it gets saved to a file instead of being piped, causing silent failures.

Safe approach (always works):

# Step 1: Save to file
mlflow traces search \
  --experiment-id <EXPERIMENT_ID> \
  [...] \
  --output json > /tmp/output.json

# Step 2: Process the file
cat /tmp/output.json | jq '.traces[0].info.trace_id'
head -50 /tmp/output.json
wc -l /tmp/output.json

Never pipe MLflow CLI output directly (e.g., mlflow traces search ... | jq '.'). This can silently produce no output. Always redirect to a file first, then run commands on the file.

To inspect a specific turn in detail (e.g., after identifying a problematic turn), fetch its full trace:

mlflow traces get --trace-id <TRACE_ID> > /tmp/turn_detail.json

Codebase Correlation

  • Session ID assignment: Search the codebase for where mlflow.trace.session is set to understand how sessions are created — per user login, per browser tab, per explicit "new conversation" action, etc.
  • Context window management: Look for how the application constructs the message history passed to the LLM at each turn. Common patterns include sliding window (last N messages), summarization of older turns, or full history. This implementation determines what context the model sees and is a frequent source of multi-turn failures.
  • Memory and state: Some applications maintain state across turns beyond message history (e.g., extracted entities, user preferences, accumulated tool results). Search for how this state is stored and passed between turns.

Reference Scripts

The scripts/ subdirectory contains ready-to-run bash scripts for each analysis step. All scripts follow the output handling rules above (redirect to file, then process).

  • scripts/discover_schema.sh <EXPERIMENT_ID> <SESSION_ID> — Finds the first trace in the session, fetches its full detail, and prints the root span's attribute keys and input/output values.
  • scripts/inspect_turn.sh <TRACE_ID> — Fetches a specific trace, lists all spans, highlights error spans, and shows assessments.

Example: Wrong Answer on Chat Turn 5

A user reports that their chatbot gave an incorrect answer on the 5th message of a chat conversation.

1. Discover the schema and reconstruct the conversation.

Fetch the first trace in the session and inspect the root span's attributes to find which keys hold inputs and outputs. In this case, mlflow.spanInputs contains the user query and mlflow.spanOutputs contains the assistant response. Then search all session traces, extracting those fields in chronological order. Scanning the extracted inputs and outputs confirms that turn 5's response is wrong, and reveals whether earlier turns look correct.

2. Check if the error originated in an earlier turn.

Turn 3's response contains a factual error that the user didn't challenge. Turn 4 builds on that incorrect information, and turn 5 compounds it. The root cause is in turn 3, not turn 5.

3. Analyze the root-cause turn as a single trace.

Fetch the full trace for turn 3 and analyze it — examine assessments (if any), walk the span tree, check retriever results, and correlate with code. The retriever returned an outdated document, causing the wrong answer.

4. Recommendations.

  • Fix the retriever's data source to exclude or update outdated documents.
  • Add per-turn assessments to detect errors before they propagate across the conversation.
  • Consider implementing conversation-level error detection (e.g., checking consistency of answers across turns).
Weekly Installs
69
Repository
mlflow/skills
GitHub Stars
15
First Seen
Feb 5, 2026
Installed on
gemini-cli68
github-copilot68
codex67
opencode66
amp65
kimi-cli65