production-monitoring
SKILL.md
Production Monitoring Patterns
Production-grade patterns for continuous monitoring of GenAI agents in production using MLflow registered scorers, on-demand assessment, trace archival, and metric backfill.
When to Use
- Setting up continuous production monitoring for GenAI agents
- Creating registered scorers with sampling strategies
- Implementing on-demand evaluation workflows
- Archiving MLflow traces for analysis
- Backfilling historical metrics
- Troubleshooting production scorer failures
Two Approaches: Registered Scorers vs On-Demand assess()
| Approach | Use Case | Sampling | Cost |
|---|---|---|---|
| Registered Scorers | Continuous monitoring | Configurable (1-100%) | Higher (continuous) |
| On-Demand assess() | Periodic evaluation | 100% (when run) | Lower (on-demand) |
Registered Scorers (Continuous)
from mlflow.models import scorer
from mlflow.metrics import Score
@scorer
def safety_scorer(inputs, outputs, expectations=None):
# Scorer logic
return Score(value=0.95, rationale="...")
# Register scorer
mlflow.models.register_scorer(
scorer=safety_scorer,
name="safety_scorer",
sampling_rate=0.1 # 10% sampling
)
# Start scorer (begins continuous monitoring)
scorer_instance = mlflow.models.start_scorer("safety_scorer")
For complete registered scorer patterns, see: references/registered-scorers.md
On-Demand assess()
# Evaluate specific traces on-demand
results = mlflow.genai.assess(
traces=traces_df,
evaluators=[safety_scorer, relevance_scorer]
)
Use when:
- Periodic evaluation (weekly, monthly)
- Ad-hoc analysis
- Cost optimization (evaluate only when needed)
⚠️ CRITICAL: Immutable Scorer Pattern
MANDATORY: Scorers must be immutable - scorer = scorer.start() pattern.
❌ WRONG: Mutable Scorer
scorer = mlflow.models.register_scorer(...)
scorer.start() # ❌ Modifies scorer object
scorer.stop() # ❌ May not work correctly
✅ CORRECT: Immutable Pattern
scorer = mlflow.models.register_scorer(...)
scorer_instance = scorer.start() # ✅ Returns new instance
scorer_instance.stop() # ✅ Works correctly
Why this matters:
- Scorer lifecycle operations require immutable pattern
- Prevents state corruption
- Enables proper start/stop/delete operations
For complete immutable patterns, see: references/registered-scorers.md
Registered Scorer Quick Pattern with Sampling
from mlflow.models import scorer
from mlflow.metrics import Score
@scorer
def safety_scorer(inputs, outputs, expectations=None):
"""Safety scorer with 10% sampling."""
response_text = _extract_response_text(outputs)
# ... scoring logic
return Score(value=0.95, rationale="Safe response")
# Register with sampling
safety_scorer_registered = mlflow.models.register_scorer(
scorer=safety_scorer,
name="safety_scorer",
sampling_rate=0.1 # 10% of traces evaluated
)
# Start continuous monitoring
safety_scorer_instance = safety_scorer_registered.start()
Sampling rates by scorer type:
- Safety: 100% (critical)
- Relevance: 10-20% (moderate volume)
- Guidelines: 5-10% (lower priority)
- Custom: 1-5% (cost optimization)
Custom Heuristic Scorer Pattern (Fast, 100% Sampling)
@scorer
def fast_heuristic_scorer(inputs, outputs, expectations=None):
"""
Fast heuristic scorer - runs on 100% of traces.
Use for lightweight checks (length, format, keywords).
"""
response_text = _extract_response_text(outputs)
# Fast checks
score = 1.0
issues = []
if len(response_text) < 10:
score -= 0.5
issues.append("Response too short")
if "ERROR" in response_text.upper():
score -= 0.3
issues.append("Contains error keyword")
return Score(
value=max(0.0, score),
rationale="; ".join(issues) if issues else "Passed heuristic checks"
)
On-Demand assess() Pattern
def evaluate_production_traces(
start_time: datetime,
end_time: datetime,
evaluators: list
):
"""
Evaluate production traces on-demand.
Useful for periodic evaluation or ad-hoc analysis.
"""
# Query traces from time range
traces_df = query_traces(start_time, end_time)
# Run evaluation
results = mlflow.genai.assess(
traces=traces_df,
evaluators=evaluators,
experiment_name="/Shared/health_monitor_agent/production_eval"
)
return results
Trace Archival Quick Setup
import mlflow
# Enable trace archival to Unity Catalog
mlflow.enable_databricks_trace_archival(
catalog="catalog",
schema="schema",
table="agent_traces"
)
# Traces automatically archived to Delta table
# Query: SELECT * FROM catalog.schema.agent_traces
For complete trace archival patterns, see: references/trace-archival.md
Production vs Dev Evaluation Comparison
| Aspect | Development | Production |
|---|---|---|
| Method | mlflow.genai.evaluate() |
Registered scorers + assess() |
| Sampling | 100% (full dataset) | Configurable (1-100%) |
| Frequency | On-demand | Continuous + periodic |
| Cost | Low (one-time) | Higher (continuous) |
| Dataset | Static eval dataset | Live production traces |
| Thresholds | Pre-deployment gates | Continuous monitoring |
Unified Scorer Definitions Concept
Define scorers once, use in both development and production:
# Define scorer (reusable)
@scorer
def safety_scorer(inputs, outputs, expectations=None):
# ... scorer logic
return Score(value=0.95, rationale="...")
# Use in development evaluation
results = mlflow.genai.evaluate(
model=model_uri,
data=eval_dataset,
evaluators=[safety_scorer]
)
# Use in production (register + start)
safety_registered = mlflow.models.register_scorer(
scorer=safety_scorer,
name="safety_scorer",
sampling_rate=1.0
)
safety_registered.start()
❌/✅ Patterns
Immutable Pattern
# ✅ CORRECT: Immutable pattern
scorer = mlflow.models.register_scorer(...)
scorer_instance = scorer.start() # Returns new instance
scorer_instance.stop()
# ❌ WRONG: Mutable pattern
scorer = mlflow.models.register_scorer(...)
scorer.start() # Modifies scorer object
scorer.stop() # May not work
External Imports in Scorers
# ✅ CORRECT: Import helpers inside scorer
@scorer
def safety_scorer(inputs, outputs, expectations=None):
from evaluation_helpers import _extract_response_text
response_text = _extract_response_text(outputs)
# ...
# ❌ WRONG: External imports at module level
from evaluation_helpers import _extract_response_text # ❌ May fail in production
@scorer
def safety_scorer(inputs, outputs, expectations=None):
response_text = _extract_response_text(outputs)
# ...
Validation Checklist
Before deploying production monitoring:
Registered Scorers
- ✅ Immutable pattern used (
scorer = scorer.start()) - ✅ Sampling rates configured appropriately
- ✅ Scorer lifecycle managed (register/start/stop/delete)
- ✅ External imports inside scorer functions
On-Demand Assessment
-
assess()function implemented for periodic evaluation - Trace querying patterns implemented
- Results logged to experiments
Trace Archival
- Trace archival enabled
- Delta table schema configured
- Query patterns for archived traces
Metric Backfill
- Backfill workflow implemented (if needed)
- Historical analysis patterns defined
Reference Files
references/registered-scorers.md- Complete registered scorer patterns with lifecycle managementreferences/trace-archival.md- Trace archival setup and querying patternsreferences/metric-backfill.md- Backfill patterns for historical metricsreferences/monitoring-dashboard-queries.md- SQL queries for monitoring dashboardsscripts/register_production_scorers.py- Complete script to register all production scorers
References
Official Documentation
Related Skills
mlflow-genai-evaluation- Development evaluation patternsdeployment-automation- Deployment workflowsresponses-agent-patterns- ResponsesAgent implementation
Version History
| Date | Changes |
|---|---|
| Feb 6, 2026 | Initial version: Production monitoring with immutable scorer pattern |
Weekly Installs
1
Repository
databricks-solu…templateGitHub Stars
2
First Seen
8 days ago
Security Audits
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1