YAML-Driven Gold Layer Table Setup

Overview

Use YAML schema files as the single source of truth for Gold layer table definitions. A single Python script reads YAMLs at runtime and creates tables dynamically - no code generation or embedded DDL required.

Key Principle: Schema definitions live in YAML files. Python code is generic and reusable.

When to Use This Pattern

✅ Use when:

Managing 10+ Gold layer tables across multiple domains
Schema evolves frequently and needs easy updates
Want to avoid embedded SQL DDL strings in Python
Need consistent table properties across all tables
Want YAML files to be reviewable/diffable

❌ Don't use when:

Only 1-2 simple tables to create
Complex transformation logic required during table creation
Tables have highly custom DDL not expressible in YAML

Architecture

┌─────────────────────────────┐
│  YAML Schema Files          │  ← Source of Truth
│  gold_layer_design/         │
│  yaml/{domain}/*.yaml       │
└──────────┬──────────────────┘
           │
           │ (read at runtime)
           ▼
┌─────────────────────────────┐
│  setup_tables.py            │  ← Single Generic Script
│  - find_yaml_base()         │
│  - load_yaml()              │
│  - build_create_table_ddl() │
│  - create_table()           │
└──────────┬──────────────────┘
           │
           ▼
┌─────────────────────────────┐
│  Gold Layer Tables          │
│  (with PKs, comments, etc.) │
└─────────────────────────────┘

YAML Schema Structure Summary

Standard YAML Schema File

# gold_layer_design/yaml/{domain}/{table_name}.yaml

table_name: dim_cluster
domain: compute
description: >
  Gold layer compute cluster dimension with configuration attributes.

primary_key:
  columns: ['workspace_id', 'cluster_id']
  composite: true

columns:
  - name: workspace_id
    type: STRING
    nullable: false
    description: "Workspace identifier..."
  
  - name: cluster_id
    type: STRING
    nullable: false
    description: "Cluster identifier..."

Column Definition Fields

Field	Required	Description
`name`	Yes	Column name
`type`	Yes	SQL data type (STRING, INT, BIGINT, TIMESTAMP, etc.)
`nullable`	No	Default: `true`. Set `false` for NOT NULL
`description`	No	Column comment (LLM-friendly description)

Supported Data Types

# Standard types
type: STRING
type: INT
type: BIGINT
type: DOUBLE
type: BOOLEAN
type: DATE
type: TIMESTAMP
type: DECIMAL(38,10)

# Complex types
type: ARRAY<STRING>
type: MAP<STRING,STRING>

See assets/templates/gold-table-template.yaml for complete YAML template.

Setup Script Implementation

Core Functions

# src/gold/setup_tables.py

import yaml
from pathlib import Path
from pyspark.sql import SparkSession

STANDARD_PROPERTIES = {
    "delta.enableChangeDataFeed": "true",
    "delta.enableRowTracking": "true",
    "delta.autoOptimize.autoCompact": "true",
    "delta.autoOptimize.autoCompact": "true",
    "layer": "gold",
}

def find_yaml_base() -> Path:
    """Find YAML directory - works in Databricks and locally."""
    possible_paths = [
        Path("../../gold_layer_design/yaml"),
        Path("gold_layer_design/yaml"),
    ]
    for path in possible_paths:
        if path.exists():
            return path
    raise FileNotFoundError("YAML directory not found")

def load_yaml(path: Path) -> dict:
    """Load and parse YAML file."""
    with open(path, 'r', encoding='utf-8') as f:
        return yaml.safe_load(f)

def build_create_table_ddl(catalog: str, schema: str, config: dict) -> str:
    """Build CREATE TABLE DDL from YAML config."""
    table_name = config['table_name']
    columns = config.get('columns', [])
    description = config.get('description', '')
    domain = config.get('domain', 'unknown')
    
    # Build column DDL
    col_ddls = []
    for col in columns:
        null_str = "" if col.get('nullable', True) else " NOT NULL"
        desc = escape_sql_string(col.get('description', ''))
        comment = f"\n    COMMENT '{desc}'" if desc else ""
        col_ddls.append(f"  {col['name']} {col['type']}{null_str}{comment}")
    
    columns_str = ",\n".join(col_ddls)
    
    # Build properties
    props = STANDARD_PROPERTIES.copy()
    props['domain'] = domain
    props['entity_type'] = "dimension" if table_name.startswith("dim_") else "fact"
    props_str = ",\n    ".join([f"'{k}' = '{v}'" for k, v in props.items()])
    
    return f"""CREATE OR REPLACE TABLE {catalog}.{schema}.{table_name} (
{columns_str}
)
USING DELTA
CLUSTER BY AUTO
TBLPROPERTIES (
    {props_str}
)
COMMENT '{escape_sql_string(description)}'"""

def create_table(spark: SparkSession, catalog: str, schema: str, yaml_path: Path) -> dict:
    """Create a single table from YAML file."""
    config = load_yaml(yaml_path)
    table_name = config.get('table_name', yaml_path.stem)
    
    # Create table
    ddl = build_create_table_ddl(catalog, schema, config)
    spark.sql(ddl)
    
    # Add primary key
    pk_config = config.get('primary_key', {})
    if pk_config and pk_config.get('columns'):
        pk_cols = ", ".join(pk_config['columns'])
        spark.sql(f"""
            ALTER TABLE {catalog}.{schema}.{table_name}
            ADD CONSTRAINT pk_{table_name}
            PRIMARY KEY ({pk_cols})
            NOT ENFORCED
        """)
    
    return {"table": table_name, "status": "success"}

Main Entry Point

def get_parameters():
    """Get job parameters from dbutils widgets."""
    catalog = dbutils.widgets.get("catalog")
    gold_schema = dbutils.widgets.get("gold_schema")
    domain = dbutils.widgets.get("domain")  # 'all' or specific domain
    return catalog, gold_schema, domain

def main():
    catalog, gold_schema, domain = get_parameters()
    spark = SparkSession.builder.getOrCreate()
    
    yaml_base = find_yaml_base()
    
    # Process all domains or specific domain
    if domain.lower() == "all":
        domains = [d.name for d in yaml_base.iterdir() if d.is_dir()]
    else:
        domains = [domain]
    
    for d in domains:
        yaml_files = sorted((yaml_base / d).glob("*.yaml"))
        for yaml_file in yaml_files:
            result = create_table(spark, catalog, gold_schema, yaml_file)
            print(f"✓ {result['table']}")

if __name__ == "__main__":
    main()

Asset Bundle Configuration

Sync YAML Files to Workspace

# databricks.yml

sync:
  include:
    - src/**/*.py
    - sql/**/*.sql
    - gold_layer_design/yaml/**/*.yaml  # ← CRITICAL: Sync YAMLs

Simplified Job Definition

# resources/gold/gold_setup_job.yml

resources:
  jobs:
    gold_setup_job:
      name: "[${bundle.target}] Health Monitor - Gold Layer Setup"
      
      # PyYAML dependency required
      environments:
        - environment_key: default
          spec:
            environment_version: "4"
            dependencies:
              - "pyyaml>=6.0"
      
      tasks:
        # Single task creates ALL tables from YAMLs
        - task_key: setup_all_tables
          environment_key: default
          notebook_task:
            notebook_path: ../../src/gold/setup_tables.py
            base_parameters:
              catalog: ${var.catalog}
              gold_schema: ${var.gold_schema}
              domain: all  # or specific domain name
          timeout_seconds: 1800
        
        # FK constraints after all PKs exist
        - task_key: add_fk_constraints
          depends_on:
            - task_key: setup_all_tables
          environment_key: default
          notebook_task:
            notebook_path: ../../src/gold/add_all_fk_constraints.py
            base_parameters:
              catalog: ${var.catalog}
              gold_schema: ${var.gold_schema}

Directory Structure

project_root/
├── databricks.yml                          # Bundle config (sync YAMLs!)
├── gold_layer_design/
│   └── yaml/                               # YAML Schema Files
│       ├── billing/
│       │   ├── dim_sku.yaml
│       │   ├── fact_usage.yaml
│       │   └── ...
│       ├── compute/
│       │   ├── dim_cluster.yaml
│       │   └── ...
│       └── ... (other domains)
├── src/
│   └── gold/
│       ├── setup_tables.py                 # Single setup script
│       └── add_all_fk_constraints.py       # FK constraints script
└── resources/
    └── gold/
        └── gold_setup_job.yml              # Job definition

Workflow

Making Schema Changes

Edit YAML - Modify the schema in gold_layer_design/yaml/{domain}/{table}.yaml
Deploy - databricks bundle deploy -t dev (YAMLs are synced)
Run - databricks bundle run -t dev gold_setup_job

No code generation step! Changes are picked up at runtime.

Adding a New Table

Create new YAML file in appropriate domain folder
Deploy and run job
Table is created automatically

Adding a New Domain

Create new folder under gold_layer_design/yaml/
Add YAML files for tables
Deploy and run job with domain: all

Comparison: Before vs After

❌ Before: Embedded DDL Strings

# 14 separate setup scripts with embedded SQL
DDL = """
CREATE OR REPLACE TABLE ${catalog}.${schema}.dim_sku (
  sku_name STRING NOT NULL COMMENT '...',
  ...
)
"""

Problems:

SQL embedded as strings (escaping issues)
14 separate files to maintain
Schema changes require editing Python code
Hard to diff/review schema changes

✅ After: YAML-Driven

# dim_sku.yaml
table_name: dim_sku
columns:
  - name: sku_name
    type: STRING
    nullable: false
    description: "SKU identifier..."

Benefits:

YAML is single source of truth
One generic Python script for all domains
Schema changes are pure YAML edits
Easy to review schema changes in PRs

Validation Checklist

When using YAML-driven setup:

YAML files synced in databricks.yml (gold_layer_design/**/*.yaml)
PyYAML dependency in job environment
All required columns have name and type
Primary key columns marked nullable: false
Column descriptions are LLM-friendly
FK constraints handled in separate script (not in YAML)
Domain parameter supports 'all' or specific domain
Error handling for missing YAMLs/columns

Common Issues

Issue 1: YAML Files Not Found

Error: FileNotFoundError: YAML directory not found

Solution: Add YAMLs to sync in databricks.yml:

sync:
  include:
    - gold_layer_design/yaml/**/*.yaml

Issue 2: PyYAML Not Available

Error: ModuleNotFoundError: No module named 'yaml'

Solution: Add dependency to job environment:

environments:
  - environment_key: default
    spec:
      dependencies:
        - "pyyaml>=6.0"

Issue 3: PK Column Mismatch

Error: Column 'xyz' not found in table

Solution: Ensure PK columns in YAML match actual column names:

# CORRECT  
primary_key:
  columns: ['table_catalog']  # ← Match column definition

Assets

gold-table-template.yaml

Starter YAML template for new Gold layer tables. Copy and customize for your table.

See assets/templates/gold-table-template.yaml for the complete template.

Related Patterns

Databricks Asset Bundles - See common/databricks-asset-bundles skill
Unity Catalog Constraints - See common/unity-catalog-constraints skill
Gold Layer Documentation - See gold/design-workers/06-table-documentation skill

References

Pattern Origin: December 2025
Trigger: Refactoring 14 domain-specific setup scripts with embedded DDL into single YAML-driven approach
Impact: Reduced code from ~5000 lines across 14 files to ~300 lines in 1 file + 39 YAML schemas

Inputs

Gold YAML schema files from design phase (gold_layer_design/yaml/{domain}/*.yaml)
Catalog and schema names (from job parameters or dbutils widgets)
Common skill patterns: databricks-table-properties, unity-catalog-constraints, schema-management-patterns

Outputs

Created Gold tables with PKs, standard TBLPROPERTIES, CLUSTER BY AUTO
Schema created with CREATE SCHEMA IF NOT EXISTS
Predictive Optimization enabled on Gold schema
FK constraints applied (via separate add_fk_constraints.py)

Pipeline Notes to Carry Forward

Table inventory dict from build_inventory() — reuse in merge scripts (Phase 2)
YAML base path from find_yaml_base() — same path used by merge scripts
Any FK constraint failures — may indicate missing PK tables that need creation first

Next Step

Proceed to pipeline-workers/02-merge-patterns to implement Silver-to-Gold MERGE operations using the tables created in this phase.

01-yaml-table-setup