databricks-reference-architecture
SKILL.md
Databricks Reference Architecture
Overview
Production-ready lakehouse architecture with Unity Catalog, Delta Lake, and medallion pattern. Covers workspace organization, data governance, and CI/CD integration for enterprise data platforms.
Prerequisites
- Databricks workspace with Unity Catalog enabled
- Understanding of medallion architecture (bronze/silver/gold)
- Databricks CLI configured
- Terraform or Asset Bundles for infrastructure
Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│ Unity Catalog │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Bronze │ │ Silver │ │ Gold │ │
│ │ Catalog │─▶│ Catalog │─▶│ Catalog │ │
│ │ (raw) │ │ (clean) │ │ (curated)│ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ▲ │ │
│ │ ▼ │
│ ┌──────────┐ ┌──────────────┐ │
│ │ Ingestion│ │ ML Models │ │
│ │ Jobs │ │ (MLflow) │ │
│ └──────────┘ └──────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Compute: Job Clusters │ SQL Warehouses │ Interactive │
├─────────────────────────────────────────────────────────────┤
│ Governance: Row-Level Security │ Column Masking │ Lineage │
└─────────────────────────────────────────────────────────────┘
Project Structure
databricks-project/
├── src/
│ ├── ingestion/
│ │ ├── bronze_raw_events.py
│ │ ├── bronze_api_data.py
│ │ └── bronze_file_uploads.py
│ ├── transformation/
│ │ ├── silver_clean_events.py
│ │ ├── silver_deduplicate.py
│ │ └── silver_schema_enforce.py
│ ├── aggregation/
│ │ ├── gold_daily_metrics.py
│ │ ├── gold_user_features.py
│ │ └── gold_ml_features.py
│ └── ml/
│ ├── training/
│ └── inference/
├── tests/
│ ├── unit/
│ └── integration/
├── databricks.yml # Asset Bundle config
├── resources/
│ ├── jobs.yml
│ ├── pipelines.yml
│ └── clusters.yml
└── conf/
├── dev.yml
├── staging.yml
└── prod.yml
Instructions
Step 1: Configure Unity Catalog Hierarchy
-- Create catalog per environment
CREATE CATALOG IF NOT EXISTS dev_catalog;
CREATE CATALOG IF NOT EXISTS prod_catalog;
-- Create schemas per medallion layer
CREATE SCHEMA IF NOT EXISTS prod_catalog.bronze;
CREATE SCHEMA IF NOT EXISTS prod_catalog.silver;
CREATE SCHEMA IF NOT EXISTS prod_catalog.gold;
CREATE SCHEMA IF NOT EXISTS prod_catalog.ml_features;
-- Grant permissions
GRANT USE CATALOG ON CATALOG prod_catalog TO `data-engineers`;
GRANT USE SCHEMA ON SCHEMA prod_catalog.bronze TO `data-engineers`;
GRANT SELECT ON SCHEMA prod_catalog.gold TO `analysts`;
Step 2: Define Asset Bundle Configuration
# databricks.yml
bundle:
name: data-platform
workspace:
host: https://your-workspace.databricks.com
resources:
jobs:
daily_etl:
name: "Daily ETL Pipeline"
schedule:
quartz_cron_expression: "0 0 6 * * ?"
timezone_id: "UTC"
tasks:
- task_key: bronze_ingest
notebook_task:
notebook_path: src/ingestion/bronze_raw_events.py
job_cluster_key: etl_cluster
- task_key: silver_transform
depends_on:
- task_key: bronze_ingest
notebook_task:
notebook_path: src/transformation/silver_clean_events.py
job_cluster_key: etl_cluster
job_clusters:
- job_cluster_key: etl_cluster
new_cluster:
spark_version: "14.3.x-scala2.12"
node_type_id: "i3.xlarge"
num_workers: 2
autoscale:
min_workers: 1
max_workers: 4
targets:
dev:
default: true
workspace:
host: https://dev-workspace.databricks.com
prod:
workspace:
host: https://prod-workspace.databricks.com
Step 3: Implement Medallion Pipeline Pattern
# src/ingestion/bronze_raw_events.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, input_file_name
spark = SparkSession.builder.getOrCreate()
# Bronze: raw ingestion with metadata
raw_df = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/checkpoints/bronze/schema")
.load("/data/raw/events/")
)
bronze_df = raw_df.withColumn(
"_ingested_at", current_timestamp()
).withColumn(
"_source_file", input_file_name()
)
(bronze_df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/checkpoints/bronze/events")
.toTable("prod_catalog.bronze.raw_events"))
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Schema evolution failure | New columns in source | Enable mergeSchema option |
| Job cluster timeout | Long-running tasks | Use autoscaling, increase timeout |
| Permission denied | Missing catalog grants | Check Unity Catalog ACLs |
| Delta version conflict | Concurrent writes | Use MERGE instead of INSERT |
Examples
Quick Medallion Validation
# Validate data flows through medallion layers
for layer in ["bronze", "silver", "gold"]:
count = spark.table(f"prod_catalog.{layer}.events").count()
print(f"{layer}: {count:,} rows")
Resources
Output
- Configuration files or code changes applied to the project
- Validation report confirming correct implementation
- Summary of changes made and their rationale
Weekly Installs
17
Repository
jeremylongshore…s-skillsGitHub Stars
1.6K
First Seen
Feb 14, 2026
Security Audits
Installed on
codex17
amp16
gemini-cli16
github-copilot16
kimi-cli16
opencode16