observability-setup
Observability Setup Orchestrator
End-to-end workflow for setting up Databricks observability — Lakehouse Monitoring, Anomaly Detection, AI/BI Dashboards, and SQL Alerts — on top of a completed Gold layer and semantic layer.
Predecessor: semantic-layer-setup skill (Semantic layer should be complete, but Gold tables are the minimum requirement)
Time Estimate: 3-5 hours for initial setup, 30 min per additional table/dashboard
What You'll Create:
- Lakehouse Monitors — data quality, drift, and custom business KPIs for Gold tables
- Anomaly Detection — schema-level freshness and completeness monitoring via the Data Quality API
- AI/BI Dashboards — Lakeview dashboards with monitoring widgets and business metrics
- SQL Alerts — config-driven alerting with severity-based routing
Decision Tree
| Question | Action |
|---|---|
| Setting up observability end-to-end? | Use this skill — it orchestrates everything |
| Only need Lakehouse Monitoring? | Read monitoring/01-lakehouse-monitoring-comprehensive/SKILL.md directly |
| Only need Anomaly Detection? | Read monitoring/04-anomaly-detection/SKILL.md directly |
| Only need AI/BI Dashboards? | Read monitoring/02-databricks-aibi-dashboards/SKILL.md directly |
| Only need SQL Alerts? | Read monitoring/03-sql-alerting-patterns/SKILL.md directly |
Mandatory Skill Dependencies
CRITICAL: Before generating ANY code for observability, you MUST read and follow the patterns in these common skills. Do NOT generate these patterns from memory.
| Phase | MUST Read Skill (use Read tool on SKILL.md) | What It Provides |
|---|---|---|
| All phases | common/databricks-expert-agent |
Core extraction principle: extract names from source, never hardcode |
| Monitor scripts | common/databricks-python-imports |
Pure Python module patterns for helpers |
| Job deployment | common/databricks-asset-bundles |
Job YAML, deployment patterns |
| Troubleshooting | common/databricks-autonomous-operations |
Deploy → Poll → Diagnose → Fix → Redeploy loop when jobs fail |
Monitoring-Domain Dependencies
| Skill | Requirement | What It Provides |
|---|---|---|
monitoring/01-lakehouse-monitoring-comprehensive |
MUST read at Phase 1 | Monitor setup, custom metrics, graceful degradation |
monitoring/04-anomaly-detection |
MUST read at Phase 2 | Schema-level freshness/completeness via Data Quality API |
monitoring/02-databricks-aibi-dashboards |
MUST read at Phase 3 | Dashboard JSON, widget patterns, deployment |
monitoring/03-sql-alerting-patterns |
MUST read at Phase 4 | Config-driven alerts, SDK deployment, severity routing |
🔴 Non-Negotiable Defaults
| Default | Value | Applied Where | NEVER Do This Instead |
|---|---|---|---|
| Manifest required | plans/manifests/observability-manifest.yaml |
Phase 0 — before any implementation | ❌ NEVER create artifacts via self-discovery; STOP if manifest is missing |
| Monitor type | MonitorTimeSeries or MonitorSnapshot |
Every Lakehouse Monitor | ❌ NEVER skip monitor type selection |
| Custom metrics | input_columns=[":table"] for table-level KPIs |
Every custom business metric | ❌ NEVER use column-level when table-level is needed |
| Anomaly detection ID | Schema UUID from w.schemas.get() |
Every create_monitor() call |
❌ NEVER pass three-level name as object_id |
| Anomaly detection object | Wrap in Monitor(object_type="schema", ...) |
Every create_monitor() call |
❌ NEVER pass keyword args directly |
| Dashboard deployment | Lakeview JSON with dataset_catalog/dataset_schema |
Every dashboard | ❌ NEVER hardcode catalog/schema in dashboard queries |
| Alert queries | Fully qualified table names (no parameters) | Every SQL alert query | ❌ NEVER use parameterized table names in alerts |
| Serverless | environments: block with environment_key |
Every monitoring job | ❌ NEVER define job_clusters: |
Working Memory Management
This orchestrator spans 5 phases (Phase 0-4). To maintain coherence without context pollution:
After each phase, persist a brief summary note capturing:
- Phase 0 output: Manifest loaded, planning_mode (acceleration/workshop), artifact counts (monitors, anomaly detection schemas, dashboards, alerts), validation passed
- Phase 1 output: Monitor names and IDs, tables monitored, custom metrics config, any creation failures
- Phase 2 output: Anomaly detection enabled (yes/no per schema), schema UUIDs, excluded tables, any permission errors
- Phase 3 output: Dashboard names and IDs, Lakeview JSON paths, dataset_catalog/dataset_schema settings
- Phase 4 output: Alert names and configurations, notification destinations, query SQL paths
What to keep in working memory: Only the current phase's worker skill, the artifact inventory (from Phase 0), and the previous phase's summary note. Discard intermediate outputs (full JSON payloads, raw monitoring metrics) — they are on disk and reproducible.
Phased Implementation Workflow
Phase 0: Read Plan — MANDATORY (5 minutes)
The observability manifest is REQUIRED. Do NOT proceed without it.
This orchestrator implements exactly what the project plan defined — no more, no less. The manifest plans/manifests/observability-manifest.yaml is generated by the planning/00-project-planning skill (stage 5) and serves as the implementation contract.
🔴 If the manifest does not exist, STOP and tell the user:
"The observability manifest (
plans/manifests/observability-manifest.yaml) is missing. This orchestrator requires a project plan to define which monitors, dashboards, alerts, and anomaly detection schemas to create. Please run theplanning/00-project-planningskill first (stage 5), then return here."
import yaml
from pathlib import Path
manifest_path = Path("plans/manifests/observability-manifest.yaml")
if not manifest_path.exists():
raise FileNotFoundError(
"REQUIRED: plans/manifests/observability-manifest.yaml not found. "
"Run planning/00-project-planning (stage 5) first to generate the "
"observability manifest, then re-run this orchestrator."
)
with open(manifest_path) as f:
manifest = yaml.safe_load(f)
# Respect planning mode — workshop mode means strict artifact caps
planning_mode = manifest.get('planning_mode', 'acceleration')
if planning_mode == 'workshop':
print("⚠️ Workshop mode active — creating ONLY the artifacts listed in the manifest")
# Extract implementation checklist from manifest
monitors = manifest.get('lakehouse_monitors', [])
anomaly_detection = manifest.get('anomaly_detection', {})
dashboards = manifest.get('dashboards', [])
alerts = manifest.get('alerts', [])
print(f"Plan: {len(monitors)} monitors, "
f"{len(anomaly_detection.get('schemas', []))} anomaly detection schemas, "
f"{len(dashboards)} dashboards, {len(alerts)} alerts")
# Validate summary counts match actual artifact counts
summary = manifest.get('summary', {})
assert len(monitors) == int(summary.get('total_monitors', len(monitors))), \
f"Monitor count mismatch: {len(monitors)} actual vs {summary.get('total_monitors')} in summary"
assert len(dashboards) == int(summary.get('total_dashboards', len(dashboards))), \
f"Dashboard count mismatch: {len(dashboards)} actual vs {summary.get('total_dashboards')} in summary"
assert len(alerts) == int(summary.get('total_alerts', len(alerts))), \
f"Alert count mismatch: {len(alerts)} actual vs {summary.get('total_alerts')} in summary"
What the manifest provides:
lakehouse_monitors[]— table name, monitor type, custom metrics, slicing expressionsanomaly_detection— schemas to enable, tables to exclude per schemadashboards[]— name, domain, pages, widgets with query sourcesalerts[]— alert ID, severity, query SQL, threshold, schedule, notification destinationsummary— expected artifact counts for validationplanning_mode—acceleration(full) orworkshop(capped artifacts; do NOT expand via self-discovery)
Key principle: Create ONLY the artifacts listed in the manifest. Do NOT add monitors, dashboards, or alerts beyond what the plan specified. If the plan missed something, update the plan first — then re-run this orchestrator.
Phase 1: Lakehouse Monitoring (1-2 hours)
MANDATORY: Read each skill below using the Read tool BEFORE writing any code for this phase:
| # | Skill Path | What It Provides |
|---|---|---|
| 1 | data_product_accelerator/skills/common/databricks-expert-agent/SKILL.md |
Extract-don't-generate principle |
| 2 | data_product_accelerator/skills/monitoring/01-lakehouse-monitoring-comprehensive/SKILL.md |
Monitor setup, custom metrics |
Input: manifest['lakehouse_monitors'] — iterate over the manifest's monitor list. Each entry defines table_name, monitor_type, custom_metrics, and slicing_exprs. Do NOT add monitors for tables not listed in the manifest.
Steps:
- Read
manifest['lakehouse_monitors']— this is your complete list of tables to monitor - For each entry, use the manifest's
monitor_type(TimeSeries or Snapshot) — do not override - Define custom business metrics from the manifest's
custom_metricsarray usinginput_columns=[":table"]for table-level KPIs - Create monitor setup script with graceful degradation (delete-then-create pattern)
- Deploy monitors and verify metric tables are populated
- Track completion: check off each manifest entry as its monitor is confirmed active
Phase 2: Anomaly Detection (15-30 minutes)
MANDATORY: Read each skill below using the Read tool BEFORE writing any code for this phase:
| # | Skill Path | What It Provides |
|---|---|---|
| 1 | data_product_accelerator/skills/monitoring/04-anomaly-detection/SKILL.md |
Schema-level freshness/completeness via Data Quality API |
Input: manifest['anomaly_detection'] — the manifest defines which schemas to enable and which tables to exclude per schema. If the manifest's anomaly_detection section is empty or absent, skip this phase entirely.
Why Phase 2 (right after Lakehouse Monitors): Enable anomaly detection early so ML models start training on historical commit and row-count patterns. By the time you finish Phases 3-4, the first scan results will be available for dashboards and alerts.
Steps:
- Read
manifest['anomaly_detection']['schemas']— this is your complete list of schemas to enable - For each schema entry, retrieve the schema UUID via
w.schemas.get(full_name=f"{catalog}.{schema}") - Enable anomaly detection using
w.data_quality.create_monitor()with aMonitorobject:from databricks.sdk.service.dataquality import Monitor, AnomalyDetectionConfig schema_info = w.schemas.get(full_name=f"{catalog}.{schema}") w.data_quality.create_monitor( monitor=Monitor( object_type="schema", object_id=schema_info.schema_id, anomaly_detection_config=AnomalyDetectionConfig( excluded_table_full_names=[ f"{catalog}.{schema}.staging_table", # Optional: skip staging/temp tables ] ) ) ) - Exclude tables listed in the manifest entry's
excluded_tablesviaexcluded_table_full_names - Handle idempotency — if already enabled, catch
"already exists"and skip gracefully - Verify enablement via Catalog Explorer (schema Details tab) or
w.data_quality.get_monitor() - Track completion: check off each manifest schema entry as its anomaly detection is confirmed active
- Note: first scan takes 15-30 minutes; results appear in
system.data_quality_monitoring.table_results
Phase 3: AI/BI Dashboards (1-2 hours)
MANDATORY: Read each skill below using the Read tool BEFORE writing any code for this phase:
| # | Skill Path | What It Provides |
|---|---|---|
| 1 | data_product_accelerator/skills/monitoring/02-databricks-aibi-dashboards/SKILL.md |
Dashboard JSON, widget patterns |
Input: manifest['dashboards'] — iterate over the manifest's dashboard list. Each entry defines name, domain, pages, and widgets. Do NOT create dashboards not listed in the manifest.
Steps:
- Read
manifest['dashboards']— this is your complete list of dashboards to create - For each dashboard, use the manifest's page layout and widget definitions to design the Lakeview JSON
- Create queries using monitoring profile/drift tables as specified in widget
query_source - Build widget configurations with proper number formatting
- Set
dataset_cataloganddataset_schemafor environment portability - Deploy dashboard via Asset Bundle or API
- Track completion: check off each manifest entry as its dashboard is confirmed deployed
Phase 4: SQL Alerts (1 hour)
MANDATORY: Read each skill below using the Read tool BEFORE writing any code for this phase:
| # | Skill Path | What It Provides |
|---|---|---|
| 1 | data_product_accelerator/skills/monitoring/03-sql-alerting-patterns/SKILL.md |
Config-driven alerts, SDK deployment |
| 2 | data_product_accelerator/skills/common/databricks-asset-bundles/SKILL.md |
Job YAML for alert deployment |
Input: manifest['alerts'] — iterate over the manifest's alert list. Each entry defines alert_id, severity, query, threshold, schedule, and notification_destination. Do NOT create alerts not listed in the manifest.
Steps:
- Read
manifest['alerts']— this is your complete list of alerts to create - Create alert configuration table (Delta table-based, severity-driven) using the manifest's alert definitions
- Use each alert's
querySQL,threshold,operator, andscheduleexactly as specified in the manifest - Deploy alerts via Databricks SDK (V2 dict-based or typed classes)
- Configure notification destinations per the manifest's
notification_destinationandseverityfields - Set up Quartz cron schedules from the manifest's
schedulefield - Track completion: check off each manifest
alert_idas its alert is confirmed deployed
Post-Creation Validation
Manifest Compliance (CRITICAL)
-
plans/manifests/observability-manifest.yamlwas read at Phase 0 before any implementation - Every Lakehouse Monitor maps 1:1 to a
lakehouse_monitors[]entry in the manifest - Every anomaly detection schema maps 1:1 to an
anomaly_detection.schemas[]entry in the manifest - Every dashboard maps 1:1 to a
dashboards[]entry in the manifest - Every alert maps 1:1 to an
alerts[]entry in the manifest - No artifacts were created via self-discovery (only manifest-driven)
- If
planning_mode: workshop, artifact counts do NOT exceed manifest totals - Manifest
summarycounts match actual deployed artifact counts
Common Skill Compliance
- Names extracted from Gold YAML (not generated) per
databricks-expert-agent - Asset Bundle YAML follows
databricks-asset-bundlespatterns - Python imports follow
databricks-python-importspatterns
Observability Specifics
- Lakehouse Monitors created for all critical Gold tables
- Custom business metrics use
input_columns=[":table"]syntax - Monitor setup uses graceful degradation (delete-then-create)
- Anomaly detection enabled on all Gold schemas (and Silver schemas if applicable)
- Anomaly detection uses schema UUID (not three-level name) as
object_id - Anomaly detection
create_monitor()wrapped inMonitorobject - Staging/temp tables excluded via
excluded_table_full_names - Anomaly detection enable is idempotent (handles "already exists" gracefully)
- Dashboard uses
dataset_catalog/dataset_schemafor portability - Dashboard widgets align with query columns
- Alert queries use fully qualified table names (no parameters)
- Alert severity routing configured (critical → PagerDuty, warning → email)
- All monitoring jobs use serverless compute
Pipeline Progression
Previous stage: semantic-layer-setup → Metric Views, TVFs, and Genie Spaces should exist
Next stage: After completing observability, proceed to:
ml/00-ml-pipeline-setup— Set up ML models, experiments, and batch inference
Related Skills
| Skill | Relationship | Path |
|---|---|---|
lakehouse-monitoring-comprehensive |
Mandatory — Monitor setup | monitoring/01-lakehouse-monitoring-comprehensive/SKILL.md |
anomaly-detection |
Mandatory — Schema-level freshness/completeness | monitoring/04-anomaly-detection/SKILL.md |
databricks-aibi-dashboards |
Mandatory — Dashboard patterns | monitoring/02-databricks-aibi-dashboards/SKILL.md |
sql-alerting-patterns |
Mandatory — Alert framework | monitoring/03-sql-alerting-patterns/SKILL.md |
databricks-expert-agent |
Mandatory — Extraction principle | common/databricks-expert-agent/SKILL.md |
databricks-asset-bundles |
Mandatory — Deployment | common/databricks-asset-bundles/SKILL.md |
databricks-python-imports |
Mandatory — Python patterns | common/databricks-python-imports/SKILL.md |
Post-Completion: Skill Usage Summary (MANDATORY)
After completing all phases of this orchestrator, output a Skill Usage Summary reflecting what you ACTUALLY did — not a pre-written summary.
What to Include
- Every skill
SKILL.mdorreferences/file you read (via the Read tool), in the order you read them - Which phase you were in when you read it
- Whether it was a Worker, Common, Cross-domain, or Reference file
- A one-line description of what you specifically used it for in this session
Format
| # | Phase | Skill / Reference Read | Type | What It Was Used For |
|---|---|---|---|---|
| 1 | Phase N | path/to/SKILL.md |
Worker / Common / Cross-domain / Reference | One-line description |
Summary Footer
End with:
- Totals: X worker skills, Y common skills, Z reference files read across N phases
- Skipped: List any skills from the dependency table above that you did NOT need to read, and why (e.g., "phase not applicable", "user skipped", "no issues encountered")
- Unplanned: List any skills you read that were NOT listed in the dependency table (e.g., for troubleshooting, edge cases, or user-requested detours)