05-schema-validation
Merge Schema Validation (Runtime)
Overview
Gold layer merge scripts must align perfectly with DDL-defined schemas. Mismatches between source DataFrame columns and target table schemas cause 59% of Gold layer bugs. This skill provides runtime validation patterns to catch schema issues before deployment.
Key Principle: DDL (setup script) is the runtime source of truth, not YAML design specifications.
Companion skill: For design-time cross-validation of YAML ↔ ERD ↔ Lineage, see design-workers/07-design-validation/SKILL.md.
When to Use This Skill
- Creating Gold layer merge scripts (Silver → Gold)
- Ensuring DataFrame columns match target table schemas
- Validating column mappings before MERGE operations
- Debugging
UNRESOLVED_COLUMNerrors in merge operations - Running pre-deployment schema validation
Core Problem: Three Sources of Truth
Gold layer development involves three schema definitions:
- YAML Design (
gold_layer_design/yaml/*.yaml) - Human-readable specifications - DDL Scripts (
src/gold/setup/*.py) - CREATE TABLE statements - Merge Scripts (
src/gold/merge/*.py) - DataFrame transformations
Problem: Manual transcription between these creates mismatches.
Common Failure Pattern
# YAML Design: user_email
# DDL Script: email (renamed during implementation)
# Merge Script: .select("user_email") # ❌ Still references old name!
# Runtime Error: [UNRESOLVED_COLUMN]
Root Cause: DDL was updated, but merge script wasn't.
Critical Rules
Rule 1: DDL is Runtime Source of Truth
Always develop merge scripts against actual created tables, not YAML designs.
Workflow:
1. YAML Design (Planning)
↓
2. DDL Creation (Reality) ← SOURCE OF TRUTH
↓
3. Run Setup Job (Create Tables)
↓
4. Validate Schema (Read actual table)
↓
5. Write Merge Script (Against actual schema)
↓
6. Pre-Deployment Validation (Compare DataFrame vs DDL)
Rule 2: Always Validate Before Merge
Use schema validation helper before every merge operation:
from helpers import validate_merge_schema
updates_df = prepare_transformations(silver_df)
validate_merge_schema(
spark, updates_df, catalog, gold_schema, "dim_user",
raise_on_mismatch=True
)
delta_gold.merge(updates_df, ...)
Rule 3: Explicit Column Mapping
Never assume Silver column names match Gold DDL. Always use explicit mapping:
updates_df = (
silver_df
.withColumn("store_id", col("store_number"))
.withColumn("product_id", col("upc_code"))
.select("store_id", "product_id", ...)
)
Quick Reference
Schema Validation Helper
validate_merge_schema(
spark: SparkSession,
updates_df: DataFrame,
catalog: str,
schema: str,
table_name: str,
raise_on_mismatch: bool = True
) -> dict
Returns validation result with:
valid: Boolean indicating if schemas matchmissing_in_df: Columns in DDL but not in DataFrameextra_in_df: Columns in DataFrame but not in DDLtype_mismatches: Columns with type differences
See references/validation-patterns.md for complete implementation with full code examples.
Pre-Merge Validation Checklist
Before writing any Gold merge script:
Pre-Development:
- Read YAML design to understand intent
- Run setup job to create actual tables
- Read DDL schema from actual table (
DESCRIBE table) - Document column mappings (Silver → Gold)
- Identify any renamed columns
During Development:
- Use explicit
.withColumn()for all renames - Use
.select()with explicit column list (no*) - Add comments documenting mappings
- Cast data types explicitly if DDL differs from source
- Include all DDL columns in final DataFrame
Pre-Deployment:
- Run schema validation helper
- Verify no extra columns in DataFrame
- Verify no missing columns from DDL
- Verify data types match DDL
- Test merge with small sample data
Core Patterns
Pattern 1: Schema Inspector Helper
Validate DataFrame against target table schema before merge:
from helpers import validate_merge_schema
validate_merge_schema(spark, updates_df, catalog, schema, table_name)
Benefits:
- Catches mismatches before Delta merge attempt
- Clear error messages with exact column differences
- Prevents cryptic Delta errors
- Validates data types, not just column names
Pattern 2: Explicit Column Mapping
Document and implement explicit Silver → Gold column mappings:
def merge_dim_store(spark, catalog, silver_schema, gold_schema):
"""
Column Mapping (Silver → Gold):
- store_number → store_number (same)
- company_rcn → company_retail_control_number (renamed)
- processed_timestamp → record_updated_timestamp (renamed)
"""
updates_df = (
silver_df
.withColumn("company_retail_control_number", col("company_rcn"))
.withColumn("record_updated_timestamp", col("processed_timestamp"))
.select("store_number", "company_retail_control_number", ...)
)
Pattern 3: DDL Schema Reader
Read actual table schema before writing merge script:
schema = spark.table(f"{catalog}.{gold_schema}.dim_user").schema
print([field.name for field in schema.fields])
updates_df = df.select("email") # Matches DDL, not YAML
Common Mistakes to Avoid
Mistake 1: Trusting YAML Over DDL
# ❌ BAD: Using YAML-defined column name
updates_df = df.select("user_email") # Doesn't match DDL!
Fix: Always verify against actual DDL:
# ✅ GOOD: Check actual table schema first
schema = spark.table("catalog.schema.dim_user").schema
updates_df = df.select("email")
Mistake 2: Implicit Column Selection
# ❌ BAD: Selecting with *
updates_df = silver_df.select("*") # Includes all Silver columns
Fix: Explicit selection:
# ✅ GOOD: Only select columns that exist in Gold DDL
updates_df = silver_df.select("column1", "column2", "column3")
Mistake 3: No Pre-Merge Validation
# ❌ BAD: Merge without validation
delta_gold.merge(updates_df, ...) # Fails at runtime with cryptic error
Fix: Validate first:
# ✅ GOOD: Validate before merge
validate_merge_schema(spark, updates_df, catalog, schema, table_name)
delta_gold.merge(updates_df, ...)
Reference Files
- Validation Patterns — Complete
validate_merge_schema()implementation, DDL schema reader, column mapping patterns, pre-merge validation workflows, error handling, and testing patterns - Validation Rules — Complete schema inspector implementation with error handling
- Validate Schema Script — Pre-deployment validation script for comparing all merge DataFrames against DDL schemas
Scripts
validate_schema.py
Pre-deployment validation script that compares all merge DataFrames against DDL schemas. Run before databricks bundle deploy to catch schema mismatches.
Usage:
python scripts/validate_schema.py \
prashanth_subrahmanyam_catalog \
dev_prashanth_subrahmanyam_system_gold
See scripts/validate_schema.py for complete implementation.
Related Skills
- Design Consistency Validation (Design):
design-workers/07-design-validation/SKILL.md— YAML↔ERD↔Lineage cross-validation during design - Grain Validation:
pipeline-workers/04-grain-validation/SKILL.md— Pre-merge grain validation for fact tables - Gold Merge Deduplication:
pipeline-workers/03-deduplication/SKILL.md— Deduplication patterns before MERGE
References
- Delta Lake Merge - Official documentation
- Rule Improvement Case Study - Complete Gold layer debugging methodology
Pattern Origin: 88 bugs across 7 Gold domains, 26% schema mismatch errors Key Lesson: DDL is runtime truth, not YAML. Always validate DataFrame against actual table schema before merge. Impact: Prevents 48% of schema-related bugs through pre-merge validation
Inputs
- Validated DataFrame (from
pipeline-workers/04-grain-validationor directly after deduplication for dimensions) - Target Gold table DDL schema (read via
spark.table(gold_table).schema) - Gold table name from YAML (
meta["table_name"])
Outputs
- Schema validation report: valid/invalid, missing columns, extra columns, type mismatches
- If valid: DataFrame is safe to MERGE
- If invalid: detailed error report with exact column differences
Pipeline Notes to Carry Forward
- DDL is runtime source of truth — not YAML design specifications
- Missing columns in DataFrame = likely a Silver-to-Gold mapping issue
- Extra columns in DataFrame = likely forgot to
.select()only Gold columns - Type mismatches = may need explicit
.cast()in column mapping
Next Step
If schema validation passes, execute the MERGE operation. If it fails, fix column mappings and re-validate before merging.