01-grain-definition

SKILL.md

Fact Table Grain Definition Patterns

Overview

Fact tables have a grain — the level of detail at which measurements are stored. Misunderstanding grain (transaction-level vs aggregated) causes 5% of Gold layer bugs but has high impact (complete table rewrite required). This skill provides patterns to define and document grain during the design phase so that implementation is unambiguous.

Key Principle: The PRIMARY KEY you choose in the YAML schema reveals the grain. Get this right during design and the merge script writes itself.

Companion skill: For runtime grain validation before MERGE, see pipeline-workers/04-grain-validation/SKILL.md.

When to Use This Skill

  • Designing fact table YAML schemas (choosing grain type)
  • Deciding between transaction, aggregated, or snapshot patterns
  • Documenting grain in YAML table_properties and column comments
  • Applying the PK-grain decision tree to determine aggregation needs
  • Reviewing fact table designs for grain ambiguity

Understanding Fact Table Grain

Grain Definition

Grain: The combination of dimensions that uniquely identifies one measurement row.

Grain Examples

Grain Description Primary Key Row Represents
Transaction One row per event transaction_id Individual sale
Daily Summary One row per day-store-product date_key, store_key, product_key Daily totals
Hourly Aggregate One row per hour-cluster date_key, hour_of_day, cluster_key Hourly metrics
Snapshot One row per entity-date entity_key, snapshot_date_key Daily snapshot

Grain Type Decision Tree

PRIMARY KEY column count?
├─ 1 column
│  ├─ Ends with _id, _uid? → Transaction Grain
│  ├─ Is date_key? → Daily Snapshot Grain
│  └─ Otherwise → Unknown (manual review)
└─ Multiple columns (Composite PK)
   ├─ Contains date_key + dimension keys? → Aggregated Grain
   ├─ Contains entity_key + date_key? → Snapshot Grain
   └─ Otherwise → Composite (manual review)

Grain Type Descriptions

Transaction-Level Fact

When to Use:

  • One row per business event (sale, click, API call)
  • Primary Key: Single surrogate key (transaction_id, event_id, request_id)
  • Measures: Individual event metrics (amount, duration, count=1)

Key Characteristics:

  • ✅ No .groupBy() or .agg() in merge script
  • ✅ Single surrogate key as PRIMARY KEY
  • ✅ Measures are direct pass-through (no SUM, AVG, COUNT)
  • ✅ One source row → one target row

Aggregated Fact

When to Use:

  • Pre-aggregated summaries (daily sales, hourly usage)
  • Primary Key: Composite key of dimensions defining grain (date_key, store_key, product_key)
  • Measures: Aggregated metrics (total_revenue, avg_latency, request_count)

Key Characteristics:

  • ✅ Uses .groupBy() on grain dimensions
  • ✅ Uses .agg() with SUM, COUNT, AVG, MAX
  • ✅ Composite PRIMARY KEY matches .groupBy() columns
  • ✅ Multiple source rows → one target row per grain

Snapshot Fact

When to Use:

  • Point-in-time state (daily inventory levels, account balances)
  • Primary Key: Entity + date (entity_key, snapshot_date_key)
  • Measures: Current values at snapshot time (on_hand_quantity, balance_amount)

Key Characteristics:

  • ✅ No aggregation (snapshots already at correct grain)
  • ✅ May need deduplication (latest snapshot wins)
  • ✅ Composite PK: entity keys + snapshot date
  • ✅ Measures are point-in-time values (not SUMs)

YAML Grain Documentation

Common Mistake: Ambiguous YAML Grain

# ❌ BAD: Grain not explicitly documented
table_name: fact_model_serving_inference
primary_key:
  columns:
    - date_key
    - endpoint_key
    - model_key
# What grain is this? Daily aggregate? Hourly? Per request?

Correct: Document Grain Explicitly

# ✅ GOOD: Grain explicitly documented
table_name: fact_model_serving_inference
grain: "Daily aggregate per endpoint-model combination"
grain_type: aggregated
primary_key:
  columns:
    - date_key
    - endpoint_key
    - model_key
measures:
  - name: request_count
    aggregation: SUM
  - name: avg_latency_ms
    aggregation: AVG

Required YAML Fields for Grain Clarity

Every fact table YAML schema MUST include:

Field Location Example
grain table_properties.grain "daily_store_product"
grain_type table_properties.grain_type or top-level "aggregated" / "transaction" / "snapshot"
Composite PK primary_key.columns ["date_key", "store_key", "product_key"]
Measure aggregation measures[].aggregation (optional) SUM, AVG, COUNT

Grain Documentation Template

Use this docstring pattern in fact table merge scripts (enforced during implementation):

def merge_fact_[table_name](spark, catalog, silver_schema, gold_schema):
    """
    Merge fact_[table_name] from Silver to Gold.
    
    GRAIN: [Describe grain in plain English]
    - Example: "One row per date-store-product combination (daily aggregate)"
    - Example: "One row per individual query execution (transaction level)"
    
    PRIMARY KEY: [List PK columns]
    - Example: (date_key, store_key, product_key)
    - Example: (query_key)
    
    GRAIN TYPE: [transaction | aggregated | snapshot]
    
    AGGREGATION: [Required | Not Required]
    - If Required: GroupBy on [dimensions], Aggregate [measures]
    """

Validation Checklist (Design Phase)

Before finalizing any fact table YAML schema:

  • Read PRIMARY KEY columns and infer grain type using decision tree
  • Document grain type explicitly in YAML (grain, grain_type)
  • Document grain in plain English in YAML description
  • Determine if aggregation will be required during implementation

Common Mistakes to Avoid

Mistake Impact Prevention
No grain in YAML Implementation guesses wrong grain → table rewrite Always document grain in YAML
Composite PK without grain_type Ambiguous: daily aggregate or hourly? Add grain_type: aggregated
Single PK that looks composite Confusion during merge script writing Document clearly: "transaction-level"

Reference Files

  • Grain Definition Patterns — Detailed pattern descriptions for transaction, aggregated, and snapshot grains with DDL examples

Related Skills

  • Grain Validation (Implementation): pipeline-workers/04-grain-validation/SKILL.md — Runtime validation of grain before MERGE operations
  • Fact Table Merge Patterns: pipeline-workers/02-merge-patterns/SKILL.md — SCD Type 1/2 and aggregation merge patterns

References

Inputs

  • From orchestrator Phase 0/1: Source schema inventory with table classifications (dimension/fact/bridge), inferred FK relationships, and domain groupings

Outputs

  • Grain type decision for each fact table (transaction / aggregated / snapshot)
  • Documented grain and grain_type fields in each fact table YAML schema
  • PK structure aligned to grain type

Design Notes to Carry Forward

After completing this skill, note:

  • Grain type for each fact table (transaction/aggregated/snapshot)
  • Which fact tables require .groupBy().agg() during implementation (aggregated grain)
  • Which fact tables have snapshot semantics (need dedup-latest logic)

Next Step

Proceed to design-workers/02-dimension-patterns/SKILL.md to apply dimension design patterns (role-playing, junk, degenerate, hierarchy flattening).


Pattern Origin: Bug #84 (wrong fact table grain), 2% of Gold bugs but high impact Key Lesson: DDL PRIMARY KEY reveals grain. Composite PK = aggregated, single PK = transaction. Impact: Prevents costly table rewrites by catching grain mismatches before implementation

Weekly Installs
1
GitHub Stars
2
First Seen
11 days ago
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1