Databricks Reference Architecture

Overview

Production-ready lakehouse architecture with Unity Catalog, Delta Lake, and the medallion pattern. Covers workspace organization, three-level namespace governance, compute strategy, CI/CD with Asset Bundles, and project structure for team collaboration.

Prerequisites

Databricks workspace with Unity Catalog enabled
Understanding of medallion architecture (bronze/silver/gold)
Databricks CLI configured
Terraform or Asset Bundles for infrastructure

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    UNITY CATALOG                                  │
│                                                                   │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌───────────┐  │
│  │  Bronze    │  │  Silver    │  │   Gold     │  │ ML Models │  │
│  │  Catalog   │─▶│  Catalog   │─▶│  Catalog   │  │ (MLflow)  │  │
│  │  (raw)     │  │  (clean)   │  │  (curated) │  │           │  │
│  └────────────┘  └────────────┘  └────────────┘  └───────────┘  │
│       ▲                                    │                      │
│  ┌────────────┐                   ┌────────────────┐             │
│  │ Auto Loader│                   │ Model Serving  │             │
│  │ Ingestion  │                   │ Endpoints      │             │
│  └────────────┘                   └────────────────┘             │
├─────────────────────────────────────────────────────────────────┤
│  Compute: Job Clusters │ SQL Warehouses │ Instance Pools        │
├─────────────────────────────────────────────────────────────────┤
│  Security: Row Filters │ Column Masks │ Secret Scopes │ SCIM   │
├─────────────────────────────────────────────────────────────────┤
│  CI/CD: Asset Bundles │ GitHub Actions │ dev/staging/prod       │
└─────────────────────────────────────────────────────────────────┘

Project Structure

databricks-platform/
├── src/
│   ├── ingestion/
│   │   ├── bronze_raw_events.py       # Auto Loader streaming
│   │   ├── bronze_api_data.py         # REST API batch ingestion
│   │   └── bronze_file_uploads.py     # Manual file uploads
│   ├── transformation/
│   │   ├── silver_clean_events.py     # Cleansing + dedup
│   │   ├── silver_schema_enforce.py   # Schema validation
│   │   └── silver_scd2.py            # Slowly changing dimensions
│   ├── aggregation/
│   │   ├── gold_daily_metrics.py      # Business KPIs
│   │   ├── gold_user_features.py      # ML feature engineering
│   │   └── gold_reporting.py          # BI-ready views
│   └── ml/
│       ├── training/
│       │   └── train_churn_model.py
│       └── inference/
│           └── batch_scoring.py
├── tests/
│   ├── conftest.py                    # Spark fixtures
│   ├── unit/                          # Local Spark tests
│   └── integration/                   # Databricks Connect tests
├── resources/
│   ├── etl_jobs.yml                   # ETL job definitions
│   ├── ml_jobs.yml                    # ML pipeline definitions
│   └── maintenance.yml                # OPTIMIZE/VACUUM schedules
├── databricks.yml                     # Asset Bundle root config
├── pyproject.toml
└── requirements.txt

Instructions

Step 1: Unity Catalog Hierarchy

-- One catalog per environment (or shared with schema isolation)
CREATE CATALOG IF NOT EXISTS dev_catalog;
CREATE CATALOG IF NOT EXISTS prod_catalog;

-- Medallion schemas per catalog
CREATE SCHEMA IF NOT EXISTS prod_catalog.bronze;
CREATE SCHEMA IF NOT EXISTS prod_catalog.silver;
CREATE SCHEMA IF NOT EXISTS prod_catalog.gold;
CREATE SCHEMA IF NOT EXISTS prod_catalog.ml_features;
CREATE SCHEMA IF NOT EXISTS prod_catalog.ml_models;

-- Permissions: engineers write bronze/silver, analysts read gold
GRANT USAGE ON CATALOG prod_catalog TO `data-engineers`;
GRANT CREATE, MODIFY, SELECT ON SCHEMA prod_catalog.bronze TO `data-engineers`;
GRANT CREATE, MODIFY, SELECT ON SCHEMA prod_catalog.silver TO `data-engineers`;
GRANT SELECT ON SCHEMA prod_catalog.gold TO `data-engineers`;

GRANT USAGE ON CATALOG prod_catalog TO `data-analysts`;
GRANT SELECT ON SCHEMA prod_catalog.gold TO `data-analysts`;

Step 2: Asset Bundle Configuration

# databricks.yml
bundle:
  name: data-platform

workspace:
  host: ${DATABRICKS_HOST}

include:
  - resources/*.yml

variables:
  catalog:
    default: dev_catalog
  alert_email:
    default: dev@company.com

targets:
  dev:
    default: true
    mode: development
    workspace:
      root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/dev

  staging:
    variables:
      catalog: staging_catalog

  prod:
    mode: production
    variables:
      catalog: prod_catalog
      alert_email: oncall@company.com
    workspace:
      root_path: /Shared/.bundle/${bundle.name}/prod

Step 3: Compute Strategy

# resources/etl_jobs.yml
resources:
  jobs:
    daily_etl:
      name: "daily-etl-${bundle.target}"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"
        timezone_id: "UTC"
      max_concurrent_runs: 1

      tasks:
        - task_key: bronze
          notebook_task:
            notebook_path: src/ingestion/bronze_raw_events.py
          job_cluster_key: etl

        - task_key: silver
          depends_on: [{task_key: bronze}]
          notebook_task:
            notebook_path: src/transformation/silver_clean_events.py
          job_cluster_key: etl

        - task_key: gold
          depends_on: [{task_key: silver}]
          notebook_task:
            notebook_path: src/aggregation/gold_daily_metrics.py
          job_cluster_key: etl

      job_clusters:
        - job_cluster_key: etl
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "i3.xlarge"
            autoscale:
              min_workers: 1
              max_workers: 4
            aws_attributes:
              availability: SPOT_WITH_FALLBACK
              first_on_demand: 1
            spark_conf:
              spark.databricks.delta.optimizeWrite.enabled: "true"
              spark.databricks.delta.autoCompact.enabled: "true"

Step 4: Medallion Pipeline Pattern

# src/ingestion/bronze_raw_events.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, input_file_name

spark = SparkSession.builder.getOrCreate()

# Bronze: Auto Loader for incremental file ingestion
raw = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/checkpoints/bronze/events/schema")
    .option("cloudFiles.inferColumnTypes", "true")
    .load("s3://data-lake/raw/events/")
    .withColumn("_ingested_at", current_timestamp())
    .withColumn("_source_file", input_file_name())
)

(raw.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/checkpoints/bronze/events/data")
    .toTable("prod_catalog.bronze.raw_events"))

Step 5: Table Maintenance Schedule

# resources/maintenance.yml
resources:
  jobs:
    weekly_optimize:
      name: "maintenance-optimize-${bundle.target}"
      schedule:
        quartz_cron_expression: "0 0 2 ? * SUN"
        timezone_id: "UTC"
      tasks:
        - task_key: optimize_tables
          notebook_task:
            notebook_path: src/maintenance/optimize_tables.py
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "m5.xlarge"
            num_workers: 1

# src/maintenance/optimize_tables.py
tables_to_optimize = [
    ("prod_catalog.silver.orders", ["order_date", "region"]),
    ("prod_catalog.silver.events", ["event_date"]),
    ("prod_catalog.gold.daily_metrics", []),
]

for table, z_cols in tables_to_optimize:
    if z_cols:
        spark.sql(f"OPTIMIZE {table} ZORDER BY ({', '.join(z_cols)})")
    else:
        spark.sql(f"OPTIMIZE {table}")
    spark.sql(f"VACUUM {table} RETAIN 168 HOURS")
    print(f"Maintained: {table}")

Output

Unity Catalog hierarchy with env-isolated catalogs and medallion schemas
Asset Bundle with dev/staging/prod targets and variable overrides
Medallion pipeline (Auto Loader > MERGE > aggregations)
RBAC grants separating engineer write from analyst read-only
Table maintenance schedule (weekly OPTIMIZE + VACUUM)

Error Handling

Issue	Cause	Solution
Schema evolution failure	New source columns	Auto Loader handles with `schemaEvolutionMode`
Permission denied on schema	Missing `USAGE` on parent catalog	`GRANT USAGE ON CATALOG` first
Concurrent write conflict	Multiple jobs writing same table	`max_concurrent_runs: 1` in job config
Cluster timeout	Long-running tasks	Set `timeout_seconds` per task

Examples

Validate Data Flow

SELECT 'bronze' AS layer, COUNT(*) AS rows FROM prod_catalog.bronze.raw_events
UNION ALL SELECT 'silver', COUNT(*) FROM prod_catalog.silver.events
UNION ALL SELECT 'gold', COUNT(*) FROM prod_catalog.gold.daily_metrics;

databricks-reference-architecture