Faker Data Generation Patterns

Overview

When generating synthetic data for Databricks Bronze layer tables, use Faker with configurable data corruption to test Silver layer data quality expectations.

Upstream: Synthetic Data Generation Workflow

The upstream databricks-synthetic-data-generation skill in AI-Dev-Kit introduces a file-based workflow:

File-Based Execution

Write Python code to a local file (e.g., scripts/generate_data.py)
Execute on Databricks using the run_python_file_on_databricks MCP tool
If execution fails, edit the local file and re-execute

Context Reuse

The first execution auto-selects a running cluster and creates an execution context. Reuse cluster_id and context_id for follow-up calls (faster: ~1s vs ~15s).

Raw Data Only

By default, generate raw transactional data only — no total_x, sum_x, avg_x fields. SDP pipelines compute aggregations downstream.

Volume-First Storage

Save data to Volumes as parquet files, not directly to tables:

VOLUME_PATH = f"/Volumes/{CATALOG}/{SCHEMA}/raw_data"
spark.createDataFrame(df).write.mode("overwrite").parquet(f"{VOLUME_PATH}/table_name")

Dynamic Date Ranges

Generate data for the last ~6 months from today using datetime.now() - timedelta(days=180).

When to Use This Skill

Use when:

Creating test data for data quality validation
Testing DLT expectations with intentional violations
Simulating production-like datasets for development/staging
Validating referential integrity between dimensions and facts

Core Principles

Realistic Data: Use Faker with non-linear distributions and temporal patterns
Referential Integrity: Maintain proper FK relationships between dimensions and facts
Configurable Corruption: Add intentional data quality issues for testing
DQ Mapping: Each corruption type maps to specific DLT expectations
Row Coherence: Attributes within a row must correlate logically
Raw Data Only: Generate transactional records -- aggregation happens in Gold
Reproducible: Always seed both np.random.seed() and Faker.seed()
Documentation: Document corruption patterns and their DQ impacts

Critical Rules

Standard Function Signature

def generate_<entity>_data(
    dimension_keys: dict,
    num_records: int = 1000,
    corruption_rate: float = 0.05
) -> list:
    """
    Generate fake <entity> data with realistic patterns.
    
    Args:
        dimension_keys: Dictionary containing dimension keys for referential integrity
        num_records: Number of records to generate
        corruption_rate: Percentage of records to intentionally corrupt (0.0 to 1.0)
        
    Returns:
        List of <entity> dictionaries
    """
    fake = Faker()
    records = []
    
    print(f"\nGenerating {num_records} <entities> (corruption rate: {corruption_rate*100}%)")
    
    for i in range(num_records):
        # Generate valid data first
        record_data = generate_valid_record(fake, dimension_keys)
        
        # Apply corruption if selected
        should_corrupt = random.random() < corruption_rate
        
        if should_corrupt:
            record_data = apply_corruption(record_data, corruption_rate)
        
        records.append(record_data)
    
    return records

🔴 MANDATORY: Seed for Reproducibility

EVERY generation script MUST seed both numpy and Faker:

import numpy as np
from faker import Faker

SEED = 42
np.random.seed(SEED)
Faker.seed(SEED)
fake = Faker()

Why: Without seeding, re-running generation produces different data, making debugging impossible and breaking snapshot tests.

🔴 MANDATORY: Non-Linear Distributions

NEVER use random.uniform() for values. Real data is never uniformly distributed:

# ❌ WRONG - Uniform (unrealistic)
prices = [random.uniform(10, 1000) for _ in range(N)]

# ✅ CORRECT - Log-normal for monetary values (prices, salaries, amounts)
prices = np.random.lognormal(mean=4.5, sigma=0.8, size=N)

# ✅ CORRECT - Exponential for durations (resolution time, session length)
durations = np.random.exponential(scale=24, size=N)

# ✅ CORRECT - Weighted categorical (not equal probability)
regions = np.random.choice(
    ['North', 'South', 'East', 'West'],
    size=N, p=[0.40, 0.25, 0.20, 0.15]
)

🔴 MANDATORY: Dynamic Date Range (Last 6 Months)

from datetime import datetime, timedelta

END_DATE = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)
START_DATE = END_DATE - timedelta(days=180)

Why: Ensures data feels current for demos and dashboards, with enough history for trend analysis.

🔴 MANDATORY: Row Coherence

Attributes within a row MUST correlate logically:

# ✅ CORRECT - tier drives amount, priority, and behavior
if tier == 'Enterprise':
    amount = np.random.lognormal(7, 0.8)        # Higher amounts
    priority = np.random.choice(['Critical', 'High', 'Medium'], p=[0.3, 0.5, 0.2])
else:
    amount = np.random.lognormal(3.5, 0.6)      # Lower amounts
    priority = np.random.choice(['High', 'Medium', 'Low'], p=[0.2, 0.5, 0.3])

# ❌ WRONG - independent random values (no correlation)
amount = random.uniform(10, 10000)  # Amount unrelated to tier
priority = random.choice(['Critical', 'High', 'Medium', 'Low'])  # Random priority

🔴 MANDATORY: Raw Data Only (No Pre-Aggregated Fields)

Generate one row per event/transaction. NEVER add aggregated columns:

# ❌ WRONG - pre-aggregated fields (aggregation belongs in Gold layer)
{"customer_id": cid, "total_orders": 47, "total_revenue": 12500.00, "avg_order_value": 265.95}

# ✅ CORRECT - one row per transaction
{"order_id": "ORD-000001", "customer_id": cid, "amount": 150.00, "order_date": "2025-10-15"}

Why: The Medallion pipeline (Silver DLT → Gold MERGE) computes aggregations downstream.

🔴 MANDATORY: Weighted Sampling for Facts

Dimension characteristics MUST drive fact generation volume and behavior:

# Build weighted lookup from dimensions
tier_weights = customers_pdf["tier"].map({'Enterprise': 5.0, 'Pro': 2.0, 'Free': 1.0})
customer_weights = (tier_weights / tier_weights.sum()).tolist()
customer_ids = customers_pdf["customer_id"].tolist()

# Enterprise customers generate 5x more events than Free
cid = np.random.choice(customer_ids, p=customer_weights)

Corruption Pattern Structure

# Determine if this record should be corrupted for DQ testing
should_corrupt = random.random() < corruption_rate

if should_corrupt:
    # Apply various DQ violations to test expectations
    corruption_type = random.choice([
        'corruption_type_1',
        'corruption_type_2',
        'corruption_type_3',
    ])
    
    if corruption_type == 'corruption_type_1':
        # Will fail: <expectation_name>
        field = invalid_value  # Description of violation

Comments Must Include

Corruption type name: Descriptive identifier
DQ expectation failed: Which expectation(s) this triggers
Violation description: What makes the data invalid

Parameter Handling

Function Parameters

def get_parameters():
    """Get parameters from notebook widgets or command line."""
    try:
        # Try Databricks widgets first (notebook mode)
        catalog = dbutils.widgets.get("catalog")
        schema = dbutils.widgets.get("schema")
        num_records = int(dbutils.widgets.get("num_records"))
        corruption_rate = float(dbutils.widgets.get("corruption_rate"))
    except:
        # Fall back to command line arguments or defaults
        catalog = "default_catalog"
        schema = "default_schema"
        num_records = 1000
        corruption_rate = 0.05  # 5% corruption by default
        
        for arg in sys.argv[1:]:
            if arg.startswith("--catalog="):
                catalog = arg.split("=")[1]
            elif arg.startswith("--schema="):
                schema = arg.split("=")[1]
            elif arg.startswith("--num_records="):
                num_records = int(arg.split("=")[1])
            elif arg.startswith("--corruption_rate="):
                corruption_rate = float(arg.split("=")[1])
    
    return catalog, schema, num_records, corruption_rate

Job Configuration (YAML)

tasks:
  - task_key: generate_data
    environment_key: default
    notebook_task:
      notebook_path: ../src/layer/generate_data.py
      base_parameters:
        catalog: ${var.catalog}
        schema: ${var.schema}
        num_records: "1000"
        corruption_rate: "0.05"  # 5% corruption for DQ testing

Quick Patterns

Corruption Type Categories

Missing Required Fields - Null or empty required fields
Invalid Format/Length - Wrong format or below minimum length
Out of Range Values - Excessive or negative values
Business Logic Violations - Field relationships that violate rules
Temporal Issues - Dates too old or in the future
Referential Integrity Issues - Missing or invalid foreign keys

Dimension vs Fact Patterns

Dimensions are referenced by facts, so must be generated first. Use locale-specific Faker for realistic data.

Facts reference dimensions, so dimensions must exist first. Load dimension keys for referential integrity.

Data Volume Guidance

Generate enough records so patterns survive downstream aggregation (daily/weekly/regional GROUP BY):

Grain	Minimum Records	Rationale
Daily time series	50-100/day	Trends visible after weekly rollup
Per category	500+ per category	Statistical significance in charts
Per customer	5-20 events/customer	Customer-level analysis works
Total rows	10K-50K minimum	Patterns survive GROUP BY

# Example: 180 days of data
N_CUSTOMERS = 2500      # Dimension
N_ORDERS = 25000        # ~10 orders/customer, ~139/day
N_TICKETS = 8000        # ~44/day, enough for weekly trends

Common Mistakes to Avoid

❌ DON'T: Use uniform distributions

# BAD - everything equally likely (unrealistic)
prices = [random.uniform(10, 1000) for _ in range(N)]
regions = [random.choice(['N', 'S', 'E', 'W']) for _ in range(N)]

✅ DO: Use realistic distributions

# GOOD - log-normal for values, weighted for categories
prices = np.random.lognormal(mean=4.5, sigma=0.8, size=N)
regions = np.random.choice(['N', 'S', 'E', 'W'], size=N, p=[0.4, 0.25, 0.2, 0.15])

❌ DON'T: Generate flat temporal data

# BAD - ignores weekends, holidays, seasonality
dates = [fake.date_between(start_date='-6m', end_date='today') for _ in range(N)]

✅ DO: Add temporal patterns

# GOOD - weekday/weekend/holiday/spike effects
def get_daily_multiplier(date, us_holidays):
    mult = 1.0
    if date.weekday() >= 5: mult *= 0.6          # Weekend drop
    if date in us_holidays: mult *= 0.3           # Holiday drop
    mult *= 1 + 0.15 * (date.month - 6) / 6      # Q4 seasonality
    return max(0.1, mult * np.random.normal(1, 0.1))

❌ DON'T: Add pre-aggregated fields

# BAD - aggregation belongs in Gold layer
{"customer_id": cid, "total_orders": 47, "avg_csat": 4.2}

✅ DO: Generate raw transactional records

# GOOD - one row per event
{"order_id": "ORD-001", "customer_id": cid, "amount": 150.00}

❌ DON'T: Apply corruption before generating valid data

# BAD - hard to maintain
if should_corrupt:
    field = generate_invalid_field()
else:
    field = generate_valid_field()

✅ DO: Generate valid data first, then corrupt

# GOOD - clean separation
field = generate_valid_field()

if should_corrupt:
    field = corrupt_field(field)  # Modify valid data

❌ DON'T: Hardcode corruption without comments

# BAD - no DQ mapping
if corruption_type == 'bad_data':
    field = None

✅ DO: Document which expectation fails

# GOOD - clear DQ mapping
if corruption_type == 'null_required_field':
    # Will fail: valid_field_name
    field = None

❌ DON'T: Use magic numbers

# BAD - unclear threshold
if random.random() < 0.05:
    # What is 0.05?

✅ DO: Use named parameter

# GOOD - explicit parameter
should_corrupt = random.random() < corruption_rate

Testing Scenarios

Development: High Corruption

corruption_rate: "0.10"  # 10% for thorough testing

Staging: Realistic Corruption

corruption_rate: "0.05"  # 5% production-like

Production: No Synthetic Corruption

corruption_rate: "0.0"  # Real data only

Validation Checklist

Realism (CRITICAL)

np.random.seed(SEED) AND Faker.seed(SEED) called at script top
Monetary values use log-normal distribution (NOT uniform)
Duration values use exponential distribution (NOT uniform)
Categorical values use weighted probabilities (NOT equal)
Row coherence: tier→amount, priority→resolution_time→CSAT correlations exist
Time patterns: weekday/weekend/holiday/seasonality multipliers applied
Dynamic date range: last 6 months from datetime.now()
No pre-aggregated fields (total_x, sum_x, avg_x)
Data volume: 10K-50K rows minimum, 50-100/day for time series

Corruption

corruption_rate parameter added with default 0.05 (5%)
Each corruption type has comment: # Will fail: <expectation_name>
Corruption types map 1:1 to DLT expectations
Valid data generated FIRST, then corrupted

Structure

Parameter handling uses dbutils.widgets.get() (NOT argparse)
Job YAML includes corruption_rate parameter
Dimensions generated BEFORE facts
Weighted sampling: dimension characteristics drive fact volume
Referential integrity maintained (facts reference valid dimension keys)
Validation prints at end (distribution checks, corruption stats)

Required Libraries

# ✅ ALWAYS include these in environment dependencies
dependencies:
  - "Faker==22.0.0"
  - "holidays>=0.40"
  - "numpy>=1.24.0"
  - "pandas>=2.0.0"

Use pandas for generation (faster row-by-row logic), convert to Spark DataFrame for saving to Delta tables.

Reference Files

Faker Providers - Detailed provider examples, corruption patterns, non-linear distribution patterns, time-based pattern functions, row coherence patterns, data volume guidance, and complete implementation examples. Includes locale-specific providers, business-specific providers, and domain-specific constants.
Generate Data Script - Data generation utility with standard function signatures, numpy-based distributions, weighted sampling, temporal patterns, seeding, and parameter handling. Includes generate_dimension_data(), generate_fact_data(), apply_corruption(), get_daily_multiplier(), and get_parameters() functions.

faker-data-generation