01-yaml-table-setup
YAML-Driven Gold Layer Table Setup
Overview
Use YAML schema files as the single source of truth for Gold layer table definitions. A single Python script reads YAMLs at runtime and creates tables dynamically - no code generation or embedded DDL required.
Key Principle: Schema definitions live in YAML files. Python code is generic and reusable.
When to Use This Pattern
✅ Use when:
- Managing 10+ Gold layer tables across multiple domains
- Schema evolves frequently and needs easy updates
- Want to avoid embedded SQL DDL strings in Python
- Need consistent table properties across all tables
- Want YAML files to be reviewable/diffable
❌ Don't use when:
- Only 1-2 simple tables to create
- Complex transformation logic required during table creation
- Tables have highly custom DDL not expressible in YAML
Architecture
┌─────────────────────────────┐
│ YAML Schema Files │ ← Source of Truth
│ gold_layer_design/ │
│ yaml/{domain}/*.yaml │
└──────────┬──────────────────┘
│
│ (read at runtime)
▼
┌─────────────────────────────┐
│ setup_tables.py │ ← Single Generic Script
│ - find_yaml_base() │
│ - load_yaml() │
│ - build_create_table_ddl() │
│ - create_table() │
└──────────┬──────────────────┘
│
▼
┌─────────────────────────────┐
│ Gold Layer Tables │
│ (with PKs, comments, etc.) │
└─────────────────────────────┘
YAML Schema Structure Summary
Standard YAML Schema File
# gold_layer_design/yaml/{domain}/{table_name}.yaml
table_name: dim_cluster
domain: compute
description: >
Gold layer compute cluster dimension with configuration attributes.
primary_key:
columns: ['workspace_id', 'cluster_id']
composite: true
columns:
- name: workspace_id
type: STRING
nullable: false
description: "Workspace identifier..."
- name: cluster_id
type: STRING
nullable: false
description: "Cluster identifier..."
Column Definition Fields
| Field | Required | Description |
|---|---|---|
name |
Yes | Column name |
type |
Yes | SQL data type (STRING, INT, BIGINT, TIMESTAMP, etc.) |
nullable |
No | Default: true. Set false for NOT NULL |
description |
No | Column comment (LLM-friendly description) |
Supported Data Types
# Standard types
type: STRING
type: INT
type: BIGINT
type: DOUBLE
type: BOOLEAN
type: DATE
type: TIMESTAMP
type: DECIMAL(38,10)
# Complex types
type: ARRAY<STRING>
type: MAP<STRING,STRING>
See assets/templates/gold-table-template.yaml for complete YAML template.
Setup Script Implementation
Core Functions
# src/gold/setup_tables.py
import yaml
from pathlib import Path
from pyspark.sql import SparkSession
STANDARD_PROPERTIES = {
"delta.enableChangeDataFeed": "true",
"delta.enableRowTracking": "true",
"delta.autoOptimize.autoCompact": "true",
"delta.autoOptimize.autoCompact": "true",
"layer": "gold",
}
def find_yaml_base() -> Path:
"""Find YAML directory - works in Databricks and locally."""
possible_paths = [
Path("../../gold_layer_design/yaml"),
Path("gold_layer_design/yaml"),
]
for path in possible_paths:
if path.exists():
return path
raise FileNotFoundError("YAML directory not found")
def load_yaml(path: Path) -> dict:
"""Load and parse YAML file."""
with open(path, 'r', encoding='utf-8') as f:
return yaml.safe_load(f)
def build_create_table_ddl(catalog: str, schema: str, config: dict) -> str:
"""Build CREATE TABLE DDL from YAML config."""
table_name = config['table_name']
columns = config.get('columns', [])
description = config.get('description', '')
domain = config.get('domain', 'unknown')
# Build column DDL
col_ddls = []
for col in columns:
null_str = "" if col.get('nullable', True) else " NOT NULL"
desc = escape_sql_string(col.get('description', ''))
comment = f"\n COMMENT '{desc}'" if desc else ""
col_ddls.append(f" {col['name']} {col['type']}{null_str}{comment}")
columns_str = ",\n".join(col_ddls)
# Build properties
props = STANDARD_PROPERTIES.copy()
props['domain'] = domain
props['entity_type'] = "dimension" if table_name.startswith("dim_") else "fact"
props_str = ",\n ".join([f"'{k}' = '{v}'" for k, v in props.items()])
return f"""CREATE OR REPLACE TABLE {catalog}.{schema}.{table_name} (
{columns_str}
)
USING DELTA
CLUSTER BY AUTO
TBLPROPERTIES (
{props_str}
)
COMMENT '{escape_sql_string(description)}'"""
def create_table(spark: SparkSession, catalog: str, schema: str, yaml_path: Path) -> dict:
"""Create a single table from YAML file."""
config = load_yaml(yaml_path)
table_name = config.get('table_name', yaml_path.stem)
# Create table
ddl = build_create_table_ddl(catalog, schema, config)
spark.sql(ddl)
# Add primary key
pk_config = config.get('primary_key', {})
if pk_config and pk_config.get('columns'):
pk_cols = ", ".join(pk_config['columns'])
spark.sql(f"""
ALTER TABLE {catalog}.{schema}.{table_name}
ADD CONSTRAINT pk_{table_name}
PRIMARY KEY ({pk_cols})
NOT ENFORCED
""")
return {"table": table_name, "status": "success"}
Main Entry Point
def get_parameters():
"""Get job parameters from dbutils widgets."""
catalog = dbutils.widgets.get("catalog")
gold_schema = dbutils.widgets.get("gold_schema")
domain = dbutils.widgets.get("domain") # 'all' or specific domain
return catalog, gold_schema, domain
def main():
catalog, gold_schema, domain = get_parameters()
spark = SparkSession.builder.getOrCreate()
yaml_base = find_yaml_base()
# Process all domains or specific domain
if domain.lower() == "all":
domains = [d.name for d in yaml_base.iterdir() if d.is_dir()]
else:
domains = [domain]
for d in domains:
yaml_files = sorted((yaml_base / d).glob("*.yaml"))
for yaml_file in yaml_files:
result = create_table(spark, catalog, gold_schema, yaml_file)
print(f"✓ {result['table']}")
if __name__ == "__main__":
main()
Asset Bundle Configuration
Sync YAML Files to Workspace
# databricks.yml
sync:
include:
- src/**/*.py
- sql/**/*.sql
- gold_layer_design/yaml/**/*.yaml # ← CRITICAL: Sync YAMLs
Simplified Job Definition
# resources/gold/gold_setup_job.yml
resources:
jobs:
gold_setup_job:
name: "[${bundle.target}] Health Monitor - Gold Layer Setup"
# PyYAML dependency required
environments:
- environment_key: default
spec:
environment_version: "4"
dependencies:
- "pyyaml>=6.0"
tasks:
# Single task creates ALL tables from YAMLs
- task_key: setup_all_tables
environment_key: default
notebook_task:
notebook_path: ../../src/gold/setup_tables.py
base_parameters:
catalog: ${var.catalog}
gold_schema: ${var.gold_schema}
domain: all # or specific domain name
timeout_seconds: 1800
# FK constraints after all PKs exist
- task_key: add_fk_constraints
depends_on:
- task_key: setup_all_tables
environment_key: default
notebook_task:
notebook_path: ../../src/gold/add_all_fk_constraints.py
base_parameters:
catalog: ${var.catalog}
gold_schema: ${var.gold_schema}
Directory Structure
project_root/
├── databricks.yml # Bundle config (sync YAMLs!)
├── gold_layer_design/
│ └── yaml/ # YAML Schema Files
│ ├── billing/
│ │ ├── dim_sku.yaml
│ │ ├── fact_usage.yaml
│ │ └── ...
│ ├── compute/
│ │ ├── dim_cluster.yaml
│ │ └── ...
│ └── ... (other domains)
├── src/
│ └── gold/
│ ├── setup_tables.py # Single setup script
│ └── add_all_fk_constraints.py # FK constraints script
└── resources/
└── gold/
└── gold_setup_job.yml # Job definition
Workflow
Making Schema Changes
- Edit YAML - Modify the schema in
gold_layer_design/yaml/{domain}/{table}.yaml - Deploy -
databricks bundle deploy -t dev(YAMLs are synced) - Run -
databricks bundle run -t dev gold_setup_job
No code generation step! Changes are picked up at runtime.
Adding a New Table
- Create new YAML file in appropriate domain folder
- Deploy and run job
- Table is created automatically
Adding a New Domain
- Create new folder under
gold_layer_design/yaml/ - Add YAML files for tables
- Deploy and run job with
domain: all
Comparison: Before vs After
❌ Before: Embedded DDL Strings
# 14 separate setup scripts with embedded SQL
DDL = """
CREATE OR REPLACE TABLE ${catalog}.${schema}.dim_sku (
sku_name STRING NOT NULL COMMENT '...',
...
)
"""
Problems:
- SQL embedded as strings (escaping issues)
- 14 separate files to maintain
- Schema changes require editing Python code
- Hard to diff/review schema changes
✅ After: YAML-Driven
# dim_sku.yaml
table_name: dim_sku
columns:
- name: sku_name
type: STRING
nullable: false
description: "SKU identifier..."
Benefits:
- YAML is single source of truth
- One generic Python script for all domains
- Schema changes are pure YAML edits
- Easy to review schema changes in PRs
Validation Checklist
When using YAML-driven setup:
- YAML files synced in
databricks.yml(gold_layer_design/**/*.yaml) - PyYAML dependency in job environment
- All required columns have
nameandtype - Primary key columns marked
nullable: false - Column descriptions are LLM-friendly
- FK constraints handled in separate script (not in YAML)
- Domain parameter supports 'all' or specific domain
- Error handling for missing YAMLs/columns
Common Issues
Issue 1: YAML Files Not Found
Error: FileNotFoundError: YAML directory not found
Solution: Add YAMLs to sync in databricks.yml:
sync:
include:
- gold_layer_design/yaml/**/*.yaml
Issue 2: PyYAML Not Available
Error: ModuleNotFoundError: No module named 'yaml'
Solution: Add dependency to job environment:
environments:
- environment_key: default
spec:
dependencies:
- "pyyaml>=6.0"
Issue 3: PK Column Mismatch
Error: Column 'xyz' not found in table
Solution: Ensure PK columns in YAML match actual column names:
# CORRECT
primary_key:
columns: ['table_catalog'] # ← Match column definition
Assets
gold-table-template.yaml
Starter YAML template for new Gold layer tables. Copy and customize for your table.
See assets/templates/gold-table-template.yaml for the complete template.
Related Patterns
- Databricks Asset Bundles - See
common/databricks-asset-bundlesskill - Unity Catalog Constraints - See
common/unity-catalog-constraintsskill - Gold Layer Documentation - See
gold/design-workers/06-table-documentationskill
References
- Databricks Asset Bundles Sync
- Unity Catalog Constraints
- Delta Lake Table Properties
- PyYAML Documentation
Pattern Origin: December 2025
Trigger: Refactoring 14 domain-specific setup scripts with embedded DDL into single YAML-driven approach
Impact: Reduced code from ~5000 lines across 14 files to ~300 lines in 1 file + 39 YAML schemas
Inputs
- Gold YAML schema files from design phase (
gold_layer_design/yaml/{domain}/*.yaml) - Catalog and schema names (from job parameters or dbutils widgets)
- Common skill patterns:
databricks-table-properties,unity-catalog-constraints,schema-management-patterns
Outputs
- Created Gold tables with PKs, standard TBLPROPERTIES, CLUSTER BY AUTO
- Schema created with
CREATE SCHEMA IF NOT EXISTS - Predictive Optimization enabled on Gold schema
- FK constraints applied (via separate
add_fk_constraints.py)
Pipeline Notes to Carry Forward
- Table inventory dict from
build_inventory()— reuse in merge scripts (Phase 2) - YAML base path from
find_yaml_base()— same path used by merge scripts - Any FK constraint failures — may indicate missing PK tables that need creation first
Next Step
Proceed to pipeline-workers/02-merge-patterns to implement Silver-to-Gold MERGE operations using the tables created in this phase.