skills/jeremylongshore/claude-code-plugins-plus-skills/databricks-reference-architecture

databricks-reference-architecture

SKILL.md

Databricks Reference Architecture

Overview

Production-ready lakehouse architecture with Unity Catalog, Delta Lake, and medallion pattern. Covers workspace organization, data governance, and CI/CD integration for enterprise data platforms.

Prerequisites

  • Databricks workspace with Unity Catalog enabled
  • Understanding of medallion architecture (bronze/silver/gold)
  • Databricks CLI configured
  • Terraform or Asset Bundles for infrastructure

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    Unity Catalog                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                  │
│  │ Bronze   │  │ Silver   │  │ Gold     │                   │
│  │ Catalog  │─▶│ Catalog  │─▶│ Catalog  │                   │
│  │ (raw)    │  │ (clean)  │  │ (curated)│                   │
│  └──────────┘  └──────────┘  └──────────┘                   │
│       ▲                            │                         │
│       │                            ▼                         │
│  ┌──────────┐              ┌──────────────┐                  │
│  │ Ingestion│              │ ML Models    │                  │
│  │ Jobs     │              │ (MLflow)     │                  │
│  └──────────┘              └──────────────┘                  │
├─────────────────────────────────────────────────────────────┤
│  Compute: Job Clusters │ SQL Warehouses │ Interactive       │
├─────────────────────────────────────────────────────────────┤
│  Governance: Row-Level Security │ Column Masking │ Lineage  │
└─────────────────────────────────────────────────────────────┘

Project Structure

databricks-project/
├── src/
│   ├── ingestion/
│   │   ├── bronze_raw_events.py
│   │   ├── bronze_api_data.py
│   │   └── bronze_file_uploads.py
│   ├── transformation/
│   │   ├── silver_clean_events.py
│   │   ├── silver_deduplicate.py
│   │   └── silver_schema_enforce.py
│   ├── aggregation/
│   │   ├── gold_daily_metrics.py
│   │   ├── gold_user_features.py
│   │   └── gold_ml_features.py
│   └── ml/
│       ├── training/
│       └── inference/
├── tests/
│   ├── unit/
│   └── integration/
├── databricks.yml           # Asset Bundle config
├── resources/
│   ├── jobs.yml
│   ├── pipelines.yml
│   └── clusters.yml
└── conf/
    ├── dev.yml
    ├── staging.yml
    └── prod.yml

Instructions

Step 1: Configure Unity Catalog Hierarchy

-- Create catalog per environment
CREATE CATALOG IF NOT EXISTS dev_catalog;
CREATE CATALOG IF NOT EXISTS prod_catalog;

-- Create schemas per medallion layer
CREATE SCHEMA IF NOT EXISTS prod_catalog.bronze;
CREATE SCHEMA IF NOT EXISTS prod_catalog.silver;
CREATE SCHEMA IF NOT EXISTS prod_catalog.gold;
CREATE SCHEMA IF NOT EXISTS prod_catalog.ml_features;

-- Grant permissions
GRANT USE CATALOG ON CATALOG prod_catalog TO `data-engineers`;
GRANT USE SCHEMA ON SCHEMA prod_catalog.bronze TO `data-engineers`;
GRANT SELECT ON SCHEMA prod_catalog.gold TO `analysts`;

Step 2: Define Asset Bundle Configuration

# databricks.yml
bundle:
  name: data-platform

workspace:
  host: https://your-workspace.databricks.com

resources:
  jobs:
    daily_etl:
      name: "Daily ETL Pipeline"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"
        timezone_id: "UTC"
      tasks:
        - task_key: bronze_ingest
          notebook_task:
            notebook_path: src/ingestion/bronze_raw_events.py
          job_cluster_key: etl_cluster
        - task_key: silver_transform
          depends_on:
            - task_key: bronze_ingest
          notebook_task:
            notebook_path: src/transformation/silver_clean_events.py
          job_cluster_key: etl_cluster

  job_clusters:
    - job_cluster_key: etl_cluster
      new_cluster:
        spark_version: "14.3.x-scala2.12"
        node_type_id: "i3.xlarge"
        num_workers: 2
        autoscale:
          min_workers: 1
          max_workers: 4

targets:
  dev:
    default: true
    workspace:
      host: https://dev-workspace.databricks.com
  prod:
    workspace:
      host: https://prod-workspace.databricks.com

Step 3: Implement Medallion Pipeline Pattern

# src/ingestion/bronze_raw_events.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, input_file_name

spark = SparkSession.builder.getOrCreate()

# Bronze: raw ingestion with metadata
raw_df = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/checkpoints/bronze/schema")
    .load("/data/raw/events/")
)

bronze_df = raw_df.withColumn(
    "_ingested_at", current_timestamp()
).withColumn(
    "_source_file", input_file_name()
)

(bronze_df.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/checkpoints/bronze/events")
    .toTable("prod_catalog.bronze.raw_events"))

Error Handling

Issue Cause Solution
Schema evolution failure New columns in source Enable mergeSchema option
Job cluster timeout Long-running tasks Use autoscaling, increase timeout
Permission denied Missing catalog grants Check Unity Catalog ACLs
Delta version conflict Concurrent writes Use MERGE instead of INSERT

Examples

Quick Medallion Validation

# Validate data flows through medallion layers
for layer in ["bronze", "silver", "gold"]:
    count = spark.table(f"prod_catalog.{layer}.events").count()
    print(f"{layer}: {count:,} rows")

Resources

Output

  • Configuration files or code changes applied to the project
  • Validation report confirming correct implementation
  • Summary of changes made and their rationale
Weekly Installs
17
GitHub Stars
1.6K
First Seen
Feb 14, 2026
Installed on
codex17
amp16
gemini-cli16
github-copilot16
kimi-cli16
opencode16