production-monitoring

SKILL.md

Production Monitoring Patterns

Production-grade patterns for continuous monitoring of GenAI agents in production using MLflow registered scorers, on-demand assessment, trace archival, and metric backfill.

When to Use

  • Setting up continuous production monitoring for GenAI agents
  • Creating registered scorers with sampling strategies
  • Implementing on-demand evaluation workflows
  • Archiving MLflow traces for analysis
  • Backfilling historical metrics
  • Troubleshooting production scorer failures

Two Approaches: Registered Scorers vs On-Demand assess()

Approach Use Case Sampling Cost
Registered Scorers Continuous monitoring Configurable (1-100%) Higher (continuous)
On-Demand assess() Periodic evaluation 100% (when run) Lower (on-demand)

Registered Scorers (Continuous)

from mlflow.models import scorer
from mlflow.metrics import Score

@scorer
def safety_scorer(inputs, outputs, expectations=None):
    # Scorer logic
    return Score(value=0.95, rationale="...")

# Register scorer
mlflow.models.register_scorer(
    scorer=safety_scorer,
    name="safety_scorer",
    sampling_rate=0.1  # 10% sampling
)

# Start scorer (begins continuous monitoring)
scorer_instance = mlflow.models.start_scorer("safety_scorer")

For complete registered scorer patterns, see: references/registered-scorers.md

On-Demand assess()

# Evaluate specific traces on-demand
results = mlflow.genai.assess(
    traces=traces_df,
    evaluators=[safety_scorer, relevance_scorer]
)

Use when:

  • Periodic evaluation (weekly, monthly)
  • Ad-hoc analysis
  • Cost optimization (evaluate only when needed)

⚠️ CRITICAL: Immutable Scorer Pattern

MANDATORY: Scorers must be immutable - scorer = scorer.start() pattern.

❌ WRONG: Mutable Scorer

scorer = mlflow.models.register_scorer(...)
scorer.start()  # ❌ Modifies scorer object
scorer.stop()   # ❌ May not work correctly

✅ CORRECT: Immutable Pattern

scorer = mlflow.models.register_scorer(...)
scorer_instance = scorer.start()  # ✅ Returns new instance
scorer_instance.stop()  # ✅ Works correctly

Why this matters:

  • Scorer lifecycle operations require immutable pattern
  • Prevents state corruption
  • Enables proper start/stop/delete operations

For complete immutable patterns, see: references/registered-scorers.md


Registered Scorer Quick Pattern with Sampling

from mlflow.models import scorer
from mlflow.metrics import Score

@scorer
def safety_scorer(inputs, outputs, expectations=None):
    """Safety scorer with 10% sampling."""
    response_text = _extract_response_text(outputs)
    # ... scoring logic
    return Score(value=0.95, rationale="Safe response")

# Register with sampling
safety_scorer_registered = mlflow.models.register_scorer(
    scorer=safety_scorer,
    name="safety_scorer",
    sampling_rate=0.1  # 10% of traces evaluated
)

# Start continuous monitoring
safety_scorer_instance = safety_scorer_registered.start()

Sampling rates by scorer type:

  • Safety: 100% (critical)
  • Relevance: 10-20% (moderate volume)
  • Guidelines: 5-10% (lower priority)
  • Custom: 1-5% (cost optimization)

Custom Heuristic Scorer Pattern (Fast, 100% Sampling)

@scorer
def fast_heuristic_scorer(inputs, outputs, expectations=None):
    """
    Fast heuristic scorer - runs on 100% of traces.
    
    Use for lightweight checks (length, format, keywords).
    """
    response_text = _extract_response_text(outputs)
    
    # Fast checks
    score = 1.0
    issues = []
    
    if len(response_text) < 10:
        score -= 0.5
        issues.append("Response too short")
    
    if "ERROR" in response_text.upper():
        score -= 0.3
        issues.append("Contains error keyword")
    
    return Score(
        value=max(0.0, score),
        rationale="; ".join(issues) if issues else "Passed heuristic checks"
    )

On-Demand assess() Pattern

def evaluate_production_traces(
    start_time: datetime,
    end_time: datetime,
    evaluators: list
):
    """
    Evaluate production traces on-demand.
    
    Useful for periodic evaluation or ad-hoc analysis.
    """
    # Query traces from time range
    traces_df = query_traces(start_time, end_time)
    
    # Run evaluation
    results = mlflow.genai.assess(
        traces=traces_df,
        evaluators=evaluators,
        experiment_name="/Shared/health_monitor_agent/production_eval"
    )
    
    return results

Trace Archival Quick Setup

import mlflow

# Enable trace archival to Unity Catalog
mlflow.enable_databricks_trace_archival(
    catalog="catalog",
    schema="schema",
    table="agent_traces"
)

# Traces automatically archived to Delta table
# Query: SELECT * FROM catalog.schema.agent_traces

For complete trace archival patterns, see: references/trace-archival.md


Production vs Dev Evaluation Comparison

Aspect Development Production
Method mlflow.genai.evaluate() Registered scorers + assess()
Sampling 100% (full dataset) Configurable (1-100%)
Frequency On-demand Continuous + periodic
Cost Low (one-time) Higher (continuous)
Dataset Static eval dataset Live production traces
Thresholds Pre-deployment gates Continuous monitoring

Unified Scorer Definitions Concept

Define scorers once, use in both development and production:

# Define scorer (reusable)
@scorer
def safety_scorer(inputs, outputs, expectations=None):
    # ... scorer logic
    return Score(value=0.95, rationale="...")

# Use in development evaluation
results = mlflow.genai.evaluate(
    model=model_uri,
    data=eval_dataset,
    evaluators=[safety_scorer]
)

# Use in production (register + start)
safety_registered = mlflow.models.register_scorer(
    scorer=safety_scorer,
    name="safety_scorer",
    sampling_rate=1.0
)
safety_registered.start()

❌/✅ Patterns

Immutable Pattern

# ✅ CORRECT: Immutable pattern
scorer = mlflow.models.register_scorer(...)
scorer_instance = scorer.start()  # Returns new instance
scorer_instance.stop()

# ❌ WRONG: Mutable pattern
scorer = mlflow.models.register_scorer(...)
scorer.start()  # Modifies scorer object
scorer.stop()   # May not work

External Imports in Scorers

# ✅ CORRECT: Import helpers inside scorer
@scorer
def safety_scorer(inputs, outputs, expectations=None):
    from evaluation_helpers import _extract_response_text
    response_text = _extract_response_text(outputs)
    # ...

# ❌ WRONG: External imports at module level
from evaluation_helpers import _extract_response_text  # ❌ May fail in production

@scorer
def safety_scorer(inputs, outputs, expectations=None):
    response_text = _extract_response_text(outputs)
    # ...

Validation Checklist

Before deploying production monitoring:

Registered Scorers

  • Immutable pattern used (scorer = scorer.start())
  • Sampling rates configured appropriately
  • Scorer lifecycle managed (register/start/stop/delete)
  • External imports inside scorer functions

On-Demand Assessment

  • assess() function implemented for periodic evaluation
  • Trace querying patterns implemented
  • Results logged to experiments

Trace Archival

  • Trace archival enabled
  • Delta table schema configured
  • Query patterns for archived traces

Metric Backfill

  • Backfill workflow implemented (if needed)
  • Historical analysis patterns defined

Reference Files

  • references/registered-scorers.md - Complete registered scorer patterns with lifecycle management
  • references/trace-archival.md - Trace archival setup and querying patterns
  • references/metric-backfill.md - Backfill patterns for historical metrics
  • references/monitoring-dashboard-queries.md - SQL queries for monitoring dashboards
  • scripts/register_production_scorers.py - Complete script to register all production scorers

References

Official Documentation

Related Skills

  • mlflow-genai-evaluation - Development evaluation patterns
  • deployment-automation - Deployment workflows
  • responses-agent-patterns - ResponsesAgent implementation

Version History

Date Changes
Feb 6, 2026 Initial version: Production monitoring with immutable scorer pattern
Weekly Installs
1
GitHub Stars
2
First Seen
8 days ago
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1