gold-layer-design
Gold Layer Design Orchestrator
This skill orchestrates the complete Gold layer design process, ensuring all mandatory deliverables are created with proper dependencies on specialized Gold layer skills.
When to Use This Skill
- Designing a Gold layer from scratch for a new project
- Creating dimensional models (facts and dimensions)
- Documenting business processes and data lineage
- Preparing for Gold layer implementation (before
03b-gold-layer-setup-prompt) - Generating mandatory documentation (Business Onboarding Guide, Source Table Mapping, Column Lineage CSV)
Critical Dependencies (Read at Indicated Phase)
This skill orchestrates the following skills. Each MUST be read and followed at the phase where it is invoked:
| Skill | Read At | Purpose |
|---|---|---|
design-workers/01-grain-definition |
Phase 2 | Grain type decision tree for fact tables |
design-workers/02-dimension-patterns |
Phase 2 | Dimension design patterns (role-playing, junk, degenerate, hierarchies) |
design-workers/03-fact-table-patterns |
Phase 2 | Fact table patterns (measure additivity, factless, accumulating snapshots) |
design-workers/04-conformed-dimensions |
Phase 2 | Enterprise integration (bus matrix, conformed dims, drill-across) |
design-workers/05-erd-diagrams |
Phase 3 | ERD creation and organization strategy |
design-workers/06-table-documentation |
Phase 4 | Dual-purpose documentation standards |
design-workers/07-design-validation |
Phase 8 | YAMLβERDβLineage cross-validation |
π΄ Non-Negotiable Defaults (YAML Schemas MUST Encode These)
The design phase produces YAML schemas that drive all downstream table creation. These YAML schemas MUST include the following properties so that setup_tables.py generates compliant DDL.
| Default | YAML Location | Value | NEVER Do This Instead |
|---|---|---|---|
| Auto Liquid Clustering | clustering: field |
auto |
β NEVER specify column names or omit clustering |
| Change Data Feed | table_properties: |
'delta.enableChangeDataFeed' = 'true' |
β NEVER omit (required for incremental propagation) |
| Row Tracking | table_properties: |
'delta.enableRowTracking' = 'true' |
β NEVER omit (breaks downstream MV refresh) |
| Auto-Optimize | table_properties: |
optimizeWrite + autoCompact = 'true' |
β NEVER omit |
| Layer Tag | table_properties: |
'layer' = 'gold' |
β NEVER omit or use wrong layer |
| PK NOT NULL | columns: with nullable: false |
All PRIMARY KEY columns | β NEVER leave PK columns nullable |
# β
CORRECT: Every Gold YAML schema MUST include these
table_name: dim_example
clustering: auto # π΄ MANDATORY (generates CLUSTER BY AUTO)
table_properties:
delta.enableChangeDataFeed: "true" # π΄ MANDATORY
delta.enableRowTracking: "true" # π΄ MANDATORY
delta.autoOptimize.optimizeWrite: "true"
delta.autoOptimize.autoCompact: "true"
layer: "gold"
Note: Serverless and Environments V4 are job-level concerns enforced by gold/01-gold-layer-setup and databricks-asset-bundles during the implementation phase, not during design.
Quick Start (4-8 hours)
Deliverables Checklist
ERD Organization (based on table count):
- Master ERD (
erd_master.md) - ALWAYS required - Domain ERDs (
erd/erd_{domain}.md) - If 9+ tables - Summary ERD (
erd_summary.md) - If 20+ tables
YAML Schemas (organized by domain):
-
gold_layer_design/yaml/{domain}/dim_*.yaml- Dimension schemas -
gold_layer_design/yaml/{domain}/fact_*.yaml- Fact schemas
Mandatory Documentation (ALL required):
- BUSINESS_ONBOARDING_GUIDE.md - Business processes + real-world stories
- COLUMN_LINEAGE.csv - Machine-readable column lineage
- COLUMN_LINEAGE.md - Human-readable lineage
- SOURCE_TABLE_MAPPING.csv - Source table inclusion/exclusion rationale
- DESIGN_SUMMARY.md - Design decisions
- DESIGN_GAP_ANALYSIS.md - Coverage analysis
- README.md - Navigation hub
ERD Organization Decision
| Tables | Strategy | Required Deliverables |
|---|---|---|
| 1-8 | Master only | erd_master.md |
| 9-20 | Master + Domain | erd_master.md + erd/erd_{domain}.md |
| 20+ | Master + Domain + Summary | erd_master.md + erd_summary.md + erd/erd_{domain}.md |
Working Memory Management
This orchestrator spans 9 phases over 4-8 hours. To maintain coherence without context pollution:
After each phase, persist a brief summary note capturing:
- Phase 0: Table inventory dict, entity classifications (dim/fact/bridge), FK relationships, suggested domains
- Phase 1: Project context (name, schemas, use cases, stakeholders)
- Phase 2: Dimensional model: dims, facts, measures, relationships, bus matrix, domain assignments
- Phase 3: ERD file paths, strategy used (master-only / master+domain / full)
- Phase 4: YAML file paths per domain, schema count, lineage gaps
- Phases 5-7: Output file paths (COLUMN_LINEAGE.csv, BUSINESS_ONBOARDING_GUIDE.md, SOURCE_TABLE_MAPPING.csv)
- Phase 8: Validation pass/fail summary, inconsistencies to fix
What to keep in working memory: Current phase's design-worker skill, the table inventory dict (Phase 0), and previous phase's summary. Discard intermediate outputs (full CSV data, raw ERD source, complete YAML contents) β they are on disk.
What to offload: Each design-worker skill has Design Notes to Carry Forward and Next Step sections. Read them to know what to pass to the next phase.
Step-by-Step Workflow
Phase 0: Source Schema Intake (MANDATORY First Step)
This is the entry point for the entire data platform build. The customer provides a source schema CSV (e.g., context/Wanderbricks_Schema.csv) containing table and column metadata. This phase parses it into a structured inventory that drives all subsequent design decisions.
Input: context/{ProjectName}_Schema.csv β CSV with columns: table_catalog, table_schema, table_name, column_name, ordinal_position, full_data_type, data_type, is_nullable, comment
Steps:
-
Read and parse the schema CSV from
context/:MANDATORY: Read
references/schema-intake-patterns.mdfor complete implementations ofparse_schema_csv(),classify_tables(), andinfer_relationships().parse_schema_csv(csv_path)β Reads CSV into structured dict keyed by table name, with columns, types, nullability, and comments per table.- Output:
{table_name: {"columns": [...], "column_count": N, "catalog": ..., "schema": ...}}
-
Classify tables as dimensions, facts, or bridge/junction:
classify_tables(schema)β Classifies each table based on column patterns:- bridge: 3 or fewer columns AND 2+ FK-like columns
- fact: 2+ FK columns AND numeric measures, OR 2+ timestamps AND 2+ FKs
- dimension: everything else
- Also identifies: PK candidates, FK columns, measures, timestamps per table
-
Identify FK relationships from column comments and naming patterns:
infer_relationships(classified)β Two inference strategies:- Pattern 1: Column comment contains "Foreign Key to 'X'" β direct FK
- Pattern 2: Column name
other_table_idmatches a known table name
- Returns list of
{from_table, from_column, to_table, to_column, source}
-
Produce Schema Intake Report:
The report summarizes: table inventory, entity classification, inferred relationships, and domain suggestions. This feeds Phase 1 (Requirements) and Phase 2 (Dimensional Modeling).
Output:
- Printed table inventory with classification (dimension/fact/bridge)
- Inferred FK relationships list
- Suggested domain groupings
- Recommended dimensional model skeleton (which tables β which Gold dims/facts)
Critical: The schema CSV is the single source of truth for table and column names. ALL downstream phases must extract names from the parsed schema β never generate table/column names from memory.
Phase 1: Requirements Gathering
Collect the following project context (enhanced with schema intake):
| Field | Example |
|---|---|
| Project Name | wanderbricks_analytics |
| Source Schema | wanderbricks (from schema CSV) |
| Gold Schema | wanderbricks_gold (convention: {project}_gold) |
| Business Domain | travel, hospitality (inferred from schema CSV) |
| Primary Use Cases | booking analytics, revenue reporting |
| Key Stakeholders | Revenue Ops, Marketing |
| Reporting Frequency | Daily, Weekly, Monthly |
| Table Count | 15 tables β 8 dims, 5 facts, 2 bridge (from Phase 0) |
| Inferred Relationships | 12 FK relationships (from Phase 0) |
Output: Populated project context document (with Phase 0 schema intake data incorporated)
Phase 2: Dimensional Model Design
Read and Follow: references/dimensional-modeling-guide.md
Activities:
- Identify 2-5 dimensions with SCD type decisions (Type 1 vs Type 2)
- Identify 1-3 facts with explicit grain definitions
- Define measures and metrics with calculation logic
- Define relationships (FK constraints)
- Assign tables to domains (Location, Product, Time, Sales, etc.)
MANDATORY: Read these skills using the Read tool in order during Phase 2:
data_product_accelerator/skills/gold/design-workers/01-grain-definition/SKILL.mdβ Grain inference from PRIMARY KEY structure, transaction vs aggregated vs snapshot patterns, PK-grain decision treedata_product_accelerator/skills/gold/design-workers/02-dimension-patterns/SKILL.mdβ Role-playing, degenerate, junk, mini-dimensions, hierarchy flattening, NULL handlingdata_product_accelerator/skills/gold/design-workers/03-fact-table-patterns/SKILL.mdβ Measure additivity, factless facts, accumulating snapshots, late-arriving datadata_product_accelerator/skills/gold/design-workers/04-conformed-dimensions/SKILL.mdβ Enterprise bus matrix, conformed dimensions, shrunken dims, drill-across patterns
Apply from skills:
- Infer grain type from PRIMARY KEY structure (transaction, aggregated, snapshot)
- Document grain explicitly for each fact table
- Validate PRIMARY KEY matches grain type using decision tree
- Apply dimension patterns (denormalize hierarchies, handle NULLs with "Unknown" rows, combine flags into junk dimensions)
- Classify measures as additive, semi-additive, or non-additive
- Build the enterprise bus matrix mapping fact tables to dimensions
- Identify conformed dimensions shared across multiple facts
Key Outputs:
- Dimensions table (name, business key, SCD type, source Silver table, dimension pattern applied)
- Facts table (name, grain, source Silver tables, update frequency, fact type)
- Measures table (name, data type, calculation logic, business purpose, additivity classification)
- Relationships table (fact FK β dimension PK, cardinality)
- Domain assignments table (table β domain mapping)
- Enterprise bus matrix (fact tables Γ dimensions)
- NULL handling strategy (Unknown member rows per dimension)
Phase 3: ERD Creation
MANDATORY: Read this skill using the Read tool BEFORE creating any ERD diagrams:
data_product_accelerator/skills/gold/design-workers/05-erd-diagrams/SKILL.mdβ ERD organization strategy (master/domain/summary), Mermaid syntax standards, domain emoji markers, relationship patterns, cross-domain references
Activities:
- Count tables to determine ERD strategy (1-8, 9-20, 20+)
- Create Master ERD (
erd_master.md) with ALL tables grouped by domain - Create Domain ERDs (
erd/erd_{domain}.md) if 9+ tables - Create Summary ERD (
erd_summary.md) if 20+ tables - Add Domain Index table to Master ERD
Critical Rules (from 05-erd-diagrams):
- Use domain emoji markers (πͺ Location, π¦ Product, π Time, π° Sales, π Inventory)
- Use
PKmarkers only (no inline descriptions in ERD) - Use
by_{column}pattern for relationship labels - Group all relationships at end of diagram
- Use bracketed notation for cross-domain references:
dim_store["πͺ dim_store (Location)"]
Outputs:
gold_layer_design/erd_master.md(ALWAYS)gold_layer_design/erd_summary.md(if 20+ tables)gold_layer_design/erd/erd_{domain}.md(if 9+ tables)
Phase 4: YAML Schema Generation
MANDATORY: Read each skill below using the Read tool BEFORE generating any YAML schema files:
data_product_accelerator/skills/gold/design-workers/06-table-documentation/SKILL.mdβ Dual-purpose description pattern ([Definition]. Business: [...]. Technical: [...].), surrogate key naming, TBLPROPERTIES metadata, column comment standards
Activities:
- Create domain-organized YAML directory structure (
yaml/{domain}/) - Generate one YAML file per table using templates from
01-yaml-table-setup - Include complete column definitions with lineage metadata
- Document PRIMARY KEY and FOREIGN KEY constraints
- Apply standard table properties (CDF, row tracking, auto-optimize)
- Write dual-purpose descriptions following
06-table-documentationpatterns
Critical Rules (from design-workers/06-table-documentation + Non-Negotiable Defaults):
- Pattern:
[Definition]. Business: [context]. Technical: [details]. - Surrogate keys as PRIMARY KEYS (not business keys)
- Include all TBLPROPERTIES (layer, domain, entity_type, grain, scd_type, etc.)
- Every column MUST have a
lineage:section - π΄
clustering: autoin EVERY YAML schema (generatesCLUSTER BY AUTO) - π΄
delta.enableChangeDataFeed: "true"in EVERY YAMLtable_properties: - π΄
delta.enableRowTracking: "true"in EVERY YAMLtable_properties: - π΄ All PRIMARY KEY columns MUST have
nullable: false
YAML Template Reference: See references/yaml-schema-patterns.md
Outputs:
gold_layer_design/yaml/{domain}/{table}.yamlfor each table- Domain-organized folder structure matching ERD domains
Phase 5: Column-Level Lineage Documentation
Read: references/lineage-documentation-guide.md
Activities:
- Extract lineage from all YAML files
- Document Bronze β Silver β Gold mapping for EVERY column
- Specify transformation type from standard list (DIRECT_COPY, RENAME, CAST, AGGREGATE_SUM, AGGREGATE_SUM_CONDITIONAL, AGGREGATE_COUNT, DERIVED_CALCULATION, HASH_MD5, COALESCE, DATE_TRUNC, GENERATED, LOOKUP)
- Document transformation logic as executable PySpark/SQL
- Generate both CSV and Markdown formats
Use Script: scripts/generate_lineage_csv.py
Outputs:
gold_layer_design/COLUMN_LINEAGE.csv(machine-readable, MANDATORY)gold_layer_design/COLUMN_LINEAGE.md(human-readable)
Phase 6: Business Onboarding Guide (MANDATORY)
Read: references/business-documentation-guide.md
Use Template: assets/templates/business-onboarding-template.md
Required Sections:
- Introduction to Business Domain
- The Business Lifecycle (Key Stages)
- Key Business Entities (Players/Actors)
- The Gold Layer Data Model (Overview)
- Business Processes & Tracking (with ASCII flow diagrams)
- 5B. Real-World Scenarios (minimum 3-4 detailed stories)
- Analytics Use Cases
- AI & ML Opportunities
- Self-Service Analytics with Genie
- Data Quality & Monitoring
- Getting Started (per role: Engineer, Analyst, Scientist, Business User)
Critical: Each story must show source system data β Gold layer updates β analytics impact at each stage.
Output: gold_layer_design/docs/BUSINESS_ONBOARDING_GUIDE.md
Phase 7: Source Table Mapping (MANDATORY)
Use Template: assets/templates/source-table-mapping-template.csv
Activities:
- List ALL source tables (included, excluded, planned)
- Assign status: INCLUDED, EXCLUDED, or PLANNED
- Provide rationale for EVERY row (required, no exceptions)
- Map INCLUDED tables to Gold tables
- Assign domains and implementation phases
Output: gold_layer_design/SOURCE_TABLE_MAPPING.csv
Phase 8: Design Validation
MANDATORY: Read this skill using the Read tool BEFORE running design validation:
data_product_accelerator/skills/gold/design-workers/07-design-validation/SKILL.mdβ YAMLβERDβLineage cross-validation, PK/FK reference consistency, mandatory field validation
Also read: references/validation-checklists.md
Validation Activities:
- Consistency check: YAML β ERD β Lineage CSV (all columns match)
- Validate all ERD columns exist in YAML definitions
- Validate all YAML columns have lineage metadata
- Validate PRIMARY KEY definitions match grain type
- Validate FOREIGN KEY references point to valid tables/columns
- Run lineage validation script
- Upstream cross-reference (conditional): If upstream source tables already exist (check via
spark.catalog.tableExists()), runcross_reference_silver_at_design_time()fromreferences/schema-intake-patterns.mdto validate YAML lineagesilver_columnvalues against actual source table schemas. Fix any mismatches found. If source tables do not exist yet, note that this validation will be enforced as a hard gate during Phase 0 of01-gold-layer-setup.
Outputs:
- Validation report (pass/fail for each category)
- List of inconsistencies to fix (including any upstream column mismatches)
- Completed design sign-off checklist
Phase 9: Stakeholder Review
Activities:
- Present ERD hierarchy to business stakeholders
- Review grain definitions for each fact table
- Confirm measures are complete for reporting needs
- Validate naming conventions with data governance
- Review Business Onboarding Guide stories for accuracy
- Obtain formal design sign-off
Output: Stakeholder approval document
Final File Organization
gold_layer_design/
βββ README.md # Navigation hub
βββ erd_master.md # Complete ERD (ALWAYS)
βββ erd_summary.md # Domain overview (20+ tables)
βββ erd/ # Domain ERDs (9+ tables)
β βββ erd_{domain}.md
βββ yaml/ # Domain-organized schemas
β βββ {domain}/
β βββ {table}.yaml
βββ docs/
β βββ BUSINESS_ONBOARDING_GUIDE.md # MANDATORY
βββ COLUMN_LINEAGE.csv # MANDATORY
βββ COLUMN_LINEAGE.md # Human-readable
βββ SOURCE_TABLE_MAPPING.csv # MANDATORY
βββ DESIGN_SUMMARY.md # Design decisions
βββ DESIGN_GAP_ANALYSIS.md # Coverage analysis
Key Design Principles
-
Schema Extraction Over Generation - Always extract table/column names from YAML. Never generate from scratch. Prevents hallucinations and schema mismatches.
-
YAML as Single Source of Truth - All column names, types, constraints, descriptions defined in YAML first. Implementation reads YAML.
-
Dual-Purpose Documentation - ALL descriptions serve both business users and technical users (including LLMs like Genie). Pattern:
[Definition]. Business: [...]. Technical: [...]. -
Progressive Disclosure - Core workflow here in SKILL.md. Detailed patterns in
references/. Templates inassets/templates/. -
Dependency Orchestration - Read and follow dependency skills at the phase indicated. Don't skip dependency skills.
Common Mistakes to Avoid
| Mistake | Impact | Prevention |
|---|---|---|
| Skipping mandatory docs | 2-3 weeks ramp-up time wasted | Always create all 3 starred deliverables |
| Wrong ERD organization | Unreadable diagrams or unnecessary complexity | Follow table count decision tree |
| Incomplete column lineage | 33% of implementation bugs | Every column needs Bronze β Silver β Gold mapping |
| No grain documentation | Table rewrite required | Document grain in every fact YAML |
| Generic business docs | Slow analytics adoption | Include real-world stories with data flow |
Time Estimates
| Complexity | Tables | Time |
|---|---|---|
| Small | 1-8 | 3-4 hours |
| Medium | 9-20 | 5-6 hours |
| Large | 20+ | 6-10 hours |
Next Steps After Design
After completing design and obtaining stakeholder sign-off, read the implementation orchestrator skill:
data_product_accelerator/skills/gold/01-gold-layer-setup/SKILL.mdβ End-to-end implementation of tables, merge scripts, FK constraints, and Asset Bundle jobs from the YAML designs created here
That skill will in turn invoke:
pipeline-workers/02-merge-patternsβ SCD Type 1/2 dimension merges, fact aggregation patternspipeline-workers/03-deduplicationβ Mandatory deduplication before MERGEpipeline-workers/01-yaml-table-setupβ YAML-driven DDL generationpipeline-workers/05-schema-validationβ DataFrameβDDL schema validationpipeline-workers/04-grain-validationβ Pre-merge grain validation- All other implementation dependencies
Pipeline Progression
This is the first stage in the Design-First pipeline. Start here after receiving the customer's source schema CSV.
Next stage: After completing Gold layer design, proceed to:
bronze/00-bronze-layer-setupβ Create Bronze tables matching the source schema and generate test data with Faker
Reference Files
- Dimensional Modeling Guide - Dimensions, facts, measures, relationships, domain assignment
- ERD Organization Strategy - Master/Domain/Summary ERD patterns
- YAML Schema Patterns - YAML templates for dimensions, facts, date dimensions
- Lineage Documentation Guide - Column lineage, transformation types, CSV generation
- Business Documentation Guide - Business Onboarding Guide and Source Table Mapping
- Validation Checklists - All design validation checklists
Post-Completion: Skill Usage Summary (MANDATORY)
After completing all phases of this orchestrator, output a Skill Usage Summary reflecting what you ACTUALLY did β not a pre-written summary.
What to Include
- Every skill
SKILL.mdorreferences/file you read (via the Read tool), in the order you read them - Which phase you were in when you read it
- Whether it was a Worker, Common, Cross-domain, or Reference file
- A one-line description of what you specifically used it for in this session
Format
| # | Phase | Skill / Reference Read | Type | What It Was Used For |
|---|---|---|---|---|
| 1 | Phase N | path/to/SKILL.md |
Worker / Common / Cross-domain / Reference | One-line description |
Summary Footer
End with:
- Totals: X worker skills, Y common skills, Z reference files read across N phases
- Skipped: List any skills from the dependency table above that you did NOT need to read, and why (e.g., "phase not applicable", "user skipped", "no issues encountered")
- Unplanned: List any skills you read that were NOT listed in the dependency table (e.g., for troubleshooting, edge cases, or user-requested detours)