Merge Schema Validation (Runtime)

Overview

Gold layer merge scripts must align perfectly with DDL-defined schemas. Mismatches between source DataFrame columns and target table schemas cause 59% of Gold layer bugs. This skill provides runtime validation patterns to catch schema issues before deployment.

Key Principle: DDL (setup script) is the runtime source of truth, not YAML design specifications.

Companion skill: For design-time cross-validation of YAML ↔ ERD ↔ Lineage, see design-workers/07-design-validation/SKILL.md.

When to Use This Skill

Creating Gold layer merge scripts (Silver → Gold)
Ensuring DataFrame columns match target table schemas
Validating column mappings before MERGE operations
Debugging UNRESOLVED_COLUMN errors in merge operations
Running pre-deployment schema validation

Core Problem: Three Sources of Truth

Gold layer development involves three schema definitions:

YAML Design (gold_layer_design/yaml/*.yaml) - Human-readable specifications
DDL Scripts (src/gold/setup/*.py) - CREATE TABLE statements
Merge Scripts (src/gold/merge/*.py) - DataFrame transformations

Problem: Manual transcription between these creates mismatches.

Common Failure Pattern

# YAML Design: user_email
# DDL Script: email (renamed during implementation)
# Merge Script: .select("user_email")  # ❌ Still references old name!
# Runtime Error: [UNRESOLVED_COLUMN]

Root Cause: DDL was updated, but merge script wasn't.

Critical Rules

Rule 1: DDL is Runtime Source of Truth

Always develop merge scripts against actual created tables, not YAML designs.

Workflow:

1. YAML Design (Planning)
   ↓
2. DDL Creation (Reality) ← SOURCE OF TRUTH
   ↓
3. Run Setup Job (Create Tables)
   ↓
4. Validate Schema (Read actual table)
   ↓
5. Write Merge Script (Against actual schema)
   ↓
6. Pre-Deployment Validation (Compare DataFrame vs DDL)

Rule 2: Always Validate Before Merge

Use schema validation helper before every merge operation:

from helpers import validate_merge_schema

updates_df = prepare_transformations(silver_df)

validate_merge_schema(
    spark, updates_df, catalog, gold_schema, "dim_user",
    raise_on_mismatch=True
)

delta_gold.merge(updates_df, ...)

Rule 3: Explicit Column Mapping

Never assume Silver column names match Gold DDL. Always use explicit mapping:

updates_df = (
    silver_df
    .withColumn("store_id", col("store_number"))
    .withColumn("product_id", col("upc_code"))
    .select("store_id", "product_id", ...)
)

Quick Reference

Schema Validation Helper

validate_merge_schema(
    spark: SparkSession,
    updates_df: DataFrame,
    catalog: str,
    schema: str,
    table_name: str,
    raise_on_mismatch: bool = True
) -> dict

Returns validation result with:

valid: Boolean indicating if schemas match
missing_in_df: Columns in DDL but not in DataFrame
extra_in_df: Columns in DataFrame but not in DDL
type_mismatches: Columns with type differences

See references/validation-patterns.md for complete implementation with full code examples.

Pre-Merge Validation Checklist

Before writing any Gold merge script:

Pre-Development:

Read YAML design to understand intent
Run setup job to create actual tables
Read DDL schema from actual table (DESCRIBE table)
Document column mappings (Silver → Gold)
Identify any renamed columns

During Development:

Use explicit .withColumn() for all renames
Use .select() with explicit column list (no *)
Add comments documenting mappings
Cast data types explicitly if DDL differs from source
Include all DDL columns in final DataFrame

Pre-Deployment:

Run schema validation helper
Verify no extra columns in DataFrame
Verify no missing columns from DDL
Verify data types match DDL
Test merge with small sample data

Core Patterns

Pattern 1: Schema Inspector Helper

Validate DataFrame against target table schema before merge:

from helpers import validate_merge_schema

validate_merge_schema(spark, updates_df, catalog, schema, table_name)

Benefits:

Catches mismatches before Delta merge attempt
Clear error messages with exact column differences
Prevents cryptic Delta errors
Validates data types, not just column names

Pattern 2: Explicit Column Mapping

Document and implement explicit Silver → Gold column mappings:

def merge_dim_store(spark, catalog, silver_schema, gold_schema):
    """
    Column Mapping (Silver → Gold):
    - store_number → store_number (same)
    - company_rcn → company_retail_control_number (renamed)
    - processed_timestamp → record_updated_timestamp (renamed)
    """
    updates_df = (
        silver_df
        .withColumn("company_retail_control_number", col("company_rcn"))
        .withColumn("record_updated_timestamp", col("processed_timestamp"))
        .select("store_number", "company_retail_control_number", ...)
    )

Pattern 3: DDL Schema Reader

Read actual table schema before writing merge script:

schema = spark.table(f"{catalog}.{gold_schema}.dim_user").schema
print([field.name for field in schema.fields])

updates_df = df.select("email")  # Matches DDL, not YAML

Common Mistakes to Avoid

Mistake 1: Trusting YAML Over DDL

# ❌ BAD: Using YAML-defined column name
updates_df = df.select("user_email")  # Doesn't match DDL!

Fix: Always verify against actual DDL:

# ✅ GOOD: Check actual table schema first
schema = spark.table("catalog.schema.dim_user").schema
updates_df = df.select("email")

Mistake 2: Implicit Column Selection

# ❌ BAD: Selecting with *
updates_df = silver_df.select("*")  # Includes all Silver columns

Fix: Explicit selection:

# ✅ GOOD: Only select columns that exist in Gold DDL
updates_df = silver_df.select("column1", "column2", "column3")

Mistake 3: No Pre-Merge Validation

# ❌ BAD: Merge without validation
delta_gold.merge(updates_df, ...)  # Fails at runtime with cryptic error

Fix: Validate first:

# ✅ GOOD: Validate before merge
validate_merge_schema(spark, updates_df, catalog, schema, table_name)
delta_gold.merge(updates_df, ...)

Reference Files

Validation Patterns — Complete validate_merge_schema() implementation, DDL schema reader, column mapping patterns, pre-merge validation workflows, error handling, and testing patterns
Validation Rules — Complete schema inspector implementation with error handling
Validate Schema Script — Pre-deployment validation script for comparing all merge DataFrames against DDL schemas

Scripts

validate_schema.py

Pre-deployment validation script that compares all merge DataFrames against DDL schemas. Run before databricks bundle deploy to catch schema mismatches.

Usage:

python scripts/validate_schema.py \
  prashanth_subrahmanyam_catalog \
  dev_prashanth_subrahmanyam_system_gold

See scripts/validate_schema.py for complete implementation.

Related Skills

Design Consistency Validation (Design): design-workers/07-design-validation/SKILL.md — YAML↔ERD↔Lineage cross-validation during design
Grain Validation: pipeline-workers/04-grain-validation/SKILL.md — Pre-merge grain validation for fact tables
Gold Merge Deduplication: pipeline-workers/03-deduplication/SKILL.md — Deduplication patterns before MERGE

References

Delta Lake Merge - Official documentation
Rule Improvement Case Study - Complete Gold layer debugging methodology

Pattern Origin: 88 bugs across 7 Gold domains, 26% schema mismatch errors Key Lesson: DDL is runtime truth, not YAML. Always validate DataFrame against actual table schema before merge. Impact: Prevents 48% of schema-related bugs through pre-merge validation

Inputs

Validated DataFrame (from pipeline-workers/04-grain-validation or directly after deduplication for dimensions)
Target Gold table DDL schema (read via spark.table(gold_table).schema)
Gold table name from YAML (meta["table_name"])

Outputs

Schema validation report: valid/invalid, missing columns, extra columns, type mismatches
If valid: DataFrame is safe to MERGE
If invalid: detailed error report with exact column differences

Pipeline Notes to Carry Forward

DDL is runtime source of truth — not YAML design specifications
Missing columns in DataFrame = likely a Silver-to-Gold mapping issue
Extra columns in DataFrame = likely forgot to .select() only Gold columns
Type mismatches = may need explicit .cast() in column mapping

Next Step

If schema validation passes, execute the MERGE operation. If it fails, fix column mappings and re-validate before merging.

05-schema-validation