anomaly-detection by databricks-solutions/vibe-coding-workshop-template

Overview

Anomaly Detection is a schema-level monitoring capability that automatically assesses data quality by evaluating freshness (is the table up-to-date?) and completeness (did we get the expected number of rows?). It uses ML models built from historical patterns — no custom metric definitions needed.

API Status: Public Preview (POST /api/data-quality/v1/monitors)

SDK Module: databricks.sdk.service.dataquality

When to Use This Skill vs. Lakehouse Monitoring

Question	Use This Skill	Use Lakehouse Monitoring
"Did my pipeline break?"	Yes — detects stale/incomplete tables	No
"Is my revenue trending correctly?"	No	Yes — custom AGGREGATE/DERIVED/DRIFT metrics
"Which tables are unhealthy?"	Yes — schema-wide scan	No (per-table only)
"What's the average transaction value?"	No	Yes — custom business KPIs
"How quickly can I set up monitoring?"	Yes — minutes (enable on schema)	2+ hours (custom metrics per table)
"Do I need ML model monitoring?"	No	Yes — `InferenceLogConfig` support

Use both together for comprehensive monitoring: anomaly detection for baseline reliability + custom metrics for business KPIs.

Core Concepts

Freshness

Analyzes commit history to build a per-table model predicting next commit time. If a commit is unusually late, the table is marked stale.

Completeness

Analyzes historical row counts to predict expected rows in the last 24 hours. If actual rows fall below the lower bound, the table is marked incomplete.

Intelligent Scanning

Automatically aligns scan frequency with table update cadence. High-impact tables (by popularity and downstream usage) are prioritized; less critical tables are scanned less frequently.

Results Storage

All results are stored in the system table:

system.data_quality_monitoring.table_results

Access: Only account admins by default. Grant access to others as needed.

Quick Start

Enable via UI (Simplest)

Navigate to schema in Catalog Explorer
Click Details tab
Click Enable under Data Quality Monitoring
Click Save

Enable via SDK

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.dataquality import Monitor, AnomalyDetectionConfig

w = WorkspaceClient()

# Get schema UUID (required — NOT the three-level name)
schema_info = w.schemas.get(full_name=f"{catalog}.{schema}")
schema_id = schema_info.schema_id

# Enable anomaly detection
monitor = w.data_quality.create_monitor(
    monitor=Monitor(
        object_type="schema",
        object_id=schema_id,
        anomaly_detection_config=AnomalyDetectionConfig(
            excluded_table_full_names=[
                f"{catalog}.{schema}.staging_table",  # Optional: tables to skip
            ]
        )
    )
)

Disable via SDK

w.data_quality.delete_monitor(
    object_type="schema",
    object_id=schema_id
)

Warning: Disabling deletes ALL anomaly detection tables and information. This cannot be undone.

Critical Rules

Rule 1: create_monitor() Takes a Monitor Object with UUID

# ✅ CORRECT: Wrap in Monitor object, use schema UUID
schema_info = w.schemas.get(full_name=f"{catalog}.{schema}")
w.data_quality.create_monitor(
    monitor=Monitor(object_type="schema", object_id=schema_info.schema_id, ...)
)

# ❌ WRONG: Three-level name, keyword args instead of Monitor object
w.data_quality.create_monitor(object_id=f"{catalog}.{schema}", ...)

Rule 2: Permissions Required

MANAGE SCHEMA or MANAGE CATALOG privileges on the target schema
To query results: SELECT on system.data_quality_monitoring.table_results

Rule 3: System Table Access Is Restricted

The results table contains data from ALL catalogs in the metastore. Use caution when granting access — it includes sample values from every monitored table.

Rule 4: Anomaly Detection Does NOT Monitor Views

Only tables are scanned. Views are silently skipped.

Rule 5: Completeness Only Checks Row Counts

Completeness does NOT evaluate null fractions, zero values, or NaN. For column-level quality monitoring, use the lakehouse-monitoring-comprehensive skill with custom metrics.

Configuration Options

Exclude Specific Tables

AnomalyDetectionConfig(
    excluded_table_full_names=[
        f"{catalog}.{schema}.staging_raw",
        f"{catalog}.{schema}.temp_processing",
    ]
)

Advanced Configuration (via REST API)

For freshness/completeness threshold overrides and table-specific settings, see references/configuration-guide.md.

Querying Results

-- Find all unhealthy tables in a schema
SELECT
  table_name,
  status,
  freshness.status as freshness_status,
  completeness.status as completeness_status,
  downstream_impact.impact_level,
  downstream_impact.num_downstream_tables
FROM system.data_quality_monitoring.table_results
WHERE catalog_name = '{catalog}'
  AND schema_name = '{schema}'
  AND status = 'Unhealthy'
ORDER BY event_time DESC;

For the complete results schema and query patterns, see references/results-schema.md.

Setting Up Alerts

-- Alert on unhealthy tables with downstream impact
SELECT
  CONCAT(catalog_name, '.', schema_name, '.', table_name) AS full_table_name,
  status,
  downstream_impact.num_queries_on_affected_tables AS impacted_queries
FROM system.data_quality_monitoring.table_results
WHERE event_time >= current_timestamp() - INTERVAL 6 HOURS
  AND status = 'Unhealthy'
  AND downstream_impact.num_queries_on_affected_tables > :min_tables_affected;

For complete alert templates including custom email notifications, see references/alert-patterns.md.

Integration with Silver & Gold Layer Setup

This skill is automatically invoked by silver-layer-setup and gold-layer-setup orchestrators. Every new Silver or Gold schema should have anomaly detection enabled to provide baseline freshness/completeness monitoring from day one.

Reusable Enable Function

Use this pattern in setup scripts to enable anomaly detection as part of schema creation:

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.dataquality import Monitor, AnomalyDetectionConfig


def enable_anomaly_detection_on_schema(
    w: WorkspaceClient,
    catalog: str,
    schema: str,
    excluded_tables: list = None,
):
    """
    Enable anomaly detection on a schema. Safe to call if already enabled.

    Called by Silver and Gold layer setup workflows to ensure every schema
    has baseline freshness/completeness monitoring.

    Args:
        w: WorkspaceClient
        catalog: Catalog name
        schema: Schema name
        excluded_tables: Optional list of table names to exclude
                         (e.g., staging, quarantine, temp tables)
    """
    schema_info = w.schemas.get(full_name=f"{catalog}.{schema}")
    schema_id = schema_info.schema_id

    excluded_full_names = None
    if excluded_tables:
        excluded_full_names = [
            f"{catalog}.{schema}.{t}" for t in excluded_tables
        ]

    try:
        monitor = w.data_quality.create_monitor(
            monitor=Monitor(
                object_type="schema",
                object_id=schema_id,
                anomaly_detection_config=AnomalyDetectionConfig(
                    excluded_table_full_names=excluded_full_names
                )
            )
        )
        print(f"✓ Anomaly detection enabled on {catalog}.{schema}")
        return monitor
    except Exception as e:
        if "already exists" in str(e).lower():
            print(f"✓ Anomaly detection already enabled on {catalog}.{schema} (skipping)")
        else:
            print(f"⚠️ Could not enable anomaly detection on {catalog}.{schema}: {e}")
            print("  (Non-blocking — schema setup continues)")

When Invoked by Silver Layer Setup

# After Silver schema creation and DLT pipeline deployment
enable_anomaly_detection_on_schema(
    w, catalog, silver_schema,
    excluded_tables=["dq_rules"]  # Exclude metadata tables
)

When Invoked by Gold Layer Setup

# After Gold tables are created via setup_tables.py
enable_anomaly_detection_on_schema(
    w, catalog, gold_schema,
    excluded_tables=None  # Monitor all Gold tables
)

Why Enable at Schema Setup Time?

Freshness baselines start building immediately — ML models need historical commit patterns
Completeness models train earlier — more scan cycles = more accurate predictions
Zero cost to enable — no custom metrics needed, runs automatically
Catches pipeline breaks early — stale/incomplete tables detected before anyone queries them
Complements custom metrics — anomaly detection for reliability, Lakehouse Monitoring for business KPIs

Workflow

Phase	Duration	Activities
Phase 1: Enable	5 min	Enable anomaly detection on schema via UI or SDK
Phase 2: Wait	15-30 min	First scan runs automatically
Phase 3: Review	10 min	Check results in Catalog Explorer or system table
Phase 4: Alerts	15 min	Set up SQL alerts on system table

Reference Files

configuration-guide.md

Complete configuration reference including:

REST API endpoint and request schema
SDK-based enable/disable patterns
excluded_table_full_names configuration
Advanced threshold overrides (freshness, completeness)
Table skip/scan lists
Event timestamp column configuration

results-schema.md

System results table documentation including:

Full schema of system.data_quality_monitoring.table_results
Nested struct documentation (freshness, completeness, downstream_impact, root_cause_analysis)
Query patterns for common scenarios
Metastore-level dashboard template reference

alert-patterns.md

SQL alert configuration including:

Alert query template with downstream impact filtering
Custom email notification template
Alert configuration steps
Integration with Databricks SQL Alerts

Scripts

enable_anomaly_detection.py

SDK-based anomaly detection management including:

Enable on schema with optional table exclusions
Disable (with warning about data deletion)
List monitored schemas
Check monitor status

query_results.py

Results querying utilities including:

Query unhealthy tables
Extract root cause analysis
Filter by freshness/completeness status
Generate summary reports

Assets

alert-query-template.sql

Ready-to-use SQL query for Databricks SQL Alerts with configurable thresholds and custom email template.

Limitations

Does NOT support views (only tables)
Completeness does not evaluate null fractions, zero values, or NaN
System table access is restricted to account admins by default
Disabling deletes all monitoring data (irreversible)
Event freshness (based on event time columns) is not supported in the current version

Troubleshooting

Monitor Not Detecting Anomalies

Check permissions: Ensure MANAGE SCHEMA or MANAGE CATALOG
Wait for scan: First scan may take 15-30 minutes
Check table activity: Tables that rarely change may be classified as "static"

Cannot Access Results Table

Default access: Only account admins can read system.data_quality_monitoring.table_results
Grant access: Account admin must grant SELECT on the system table

Tables Showing as "Training"

Normal behavior: New tables need historical data to build ML models
Wait: Models improve after several scan cycles

References

Summary

Anomaly Detection provides automated, schema-level monitoring for table freshness and completeness. It complements the lakehouse-monitoring-comprehensive skill's custom business metrics with baseline data reliability monitoring. Enable it in minutes via UI or SDK, query results from the system table, and set up SQL alerts for proactive notification.