databricks-synthetic-data-gen
Catalog and schema are always user-supplied — never default to any value. If the user hasn't provided them, ask. For any UC write, always create the schema if it doesn't exist before writing data.
Databricks Synthetic Data Generation
Generate realistic, story-driven synthetic data for Databricks using Spark + Faker + Pandas UDFs (strongly recommended).
Quick Reference
| Topic | Guide | When to Use |
|---|---|---|
| Setup & Execution | references/1-setup-and-execution.md | Setting up environment, choosing compute, installing dependencies |
| Generation Approaches | references/2-generation-approaches.md | Choosing Spark UDFs vs Polars local, writing generation code |
| Data Patterns | references/3-data-patterns.md | Creating realistic distributions, referential integrity, time patterns |
| Domain Guidance | references/4-domain-guidance.md | E-commerce, IoT, financial, support/CRM domain patterns |
| Output Formats | references/5-output-formats.md | Choosing output format, saving to volumes/tables |
| Troubleshooting | references/6-troubleshooting.md | Fixing errors, debugging issues |
| Example Script | scripts/generate_synthetic_data.py | Complete Spark + Pandas UDF example |
Package Manager
Prefer uv for all Python operations. Fall back to pip only if uv is not available.
# Preferred
uv pip install "databricks-connect>=16.4,<17.4" faker numpy pandas holidays
uv run python generate_data.py
# Fallback if uv not available
pip install "databricks-connect>=16.4,<17.4" faker numpy pandas holidays
python generate_data.py
Critical Rules
- Strongly prefer to use Spark + Faker + Pandas UDFs for data generation (scalable, parallel)
- If user specifies local then use Polars locally instead of Spark, but suggest Spark if > 30,000 rows.
- Present a plan for user approval before generating any code
- Ask for catalog/schema - do not default
- Use serverless compute unless user explicitly requests classic cluster
- Generate raw data only - no pre-aggregated fields (unless user requests)
- Create master tables first - then generate related tables with valid FKs
- NEVER use
.cache()or.persist()with serverless compute - these operations are NOT supported and will fail withAnalysisException: PERSIST TABLE is not supported on serverless compute. Instead, write master tables to Delta first, then read them back for FK joins.
Generation Planning Workflow
Before generating any code, you MUST present a plan for user approval.
⚠️ MUST DO: Confirm Catalog Before Proceeding
You MUST explicitly ask the user which catalog to use. Do not assume or proceed without confirmation.
Example prompt to user:
"Which Unity Catalog should I use for this data?"
When presenting your plan, always show the selected catalog prominently:
📍 Output Location: catalog_name.schema_name
Volume: /Volumes/catalog_name/schema_name/raw_data/
This makes it easy for the user to spot and correct if needed.
Step 1: Gather Requirements
Ask the user about:
- Catalog/Schema - Which catalog to use?
- What domain/scenario? (e-commerce, support tickets, IoT sensors, etc.)
- How many tables? What relationships between them?
- Approximate row counts per table?
- Output format preference? (Delta table is default)
Step 2: Present Table Specification
Show a clear specification with YOUR ASSUMPTIONS surfaced. Always start with the output location:
📍 Output Location: {user_catalog}.ecommerce_demo
Volume: /Volumes/{user_catalog}/ecommerce_demo/raw_data/
| Table | Columns | Rows | Key Assumptions |
|---|---|---|---|
| customers | customer_id, name, email, tier, region | 5,000 | Tier: Free 60%, Pro 30%, Enterprise 10% |
| orders | order_id, customer_id (FK), amount, status | 15,000 | Enterprise customers generate 5x more orders |
Assumptions I'm making:
- Amount distribution: log-normal by tier (Enterprise ~$1800, Pro ~$245, Free ~$55)
- Status: 65% delivered, 15% shipped, 10% processing, 5% pending, 5% cancelled
Ask user: "Does this look correct? Any adjustments to the catalog, tables, or distributions?"
Step 3: Ask About Data Features
- Skew (non-uniform distributions) - Enabled by default
- Joins (referential integrity) - Enabled by default
- Bad data injection (for data quality testing)
- Multi-language text
- Incremental mode (append vs overwrite)
Pre-Generation Checklist
- Catalog confirmed - User explicitly approved which catalog to use
- Output location shown prominently in plan (easy to spot/change)
- Table specification shown and approved
- Assumptions about distributions confirmed
- User confirmed compute preference (serverless recommended)
- Data features selected
Do NOT proceed to code generation until user approves the plan, including the catalog.
Quick Start: Spark + Faker + Pandas UDFs
from databricks.connect import DatabricksSession, DatabricksEnv
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, DoubleType
import pandas as pd
import numpy as np
# Setup with managed dependencies (databricks-connect 16.4+)
env = DatabricksEnv().withDependencies("faker", "pandas", "numpy")
spark = DatabricksSession.builder.withEnvironment(env).serverless(True).getOrCreate()
# Define Pandas UDFs
@F.pandas_udf(StringType())
def fake_name(ids: pd.Series) -> pd.Series:
from faker import Faker
fake = Faker()
return pd.Series([fake.name() for _ in range(len(ids))])
@F.pandas_udf(DoubleType())
def generate_amount(tiers: pd.Series) -> pd.Series:
amounts = []
for tier in tiers:
if tier == "Enterprise":
amounts.append(float(np.random.lognormal(7.5, 0.8)))
elif tier == "Pro":
amounts.append(float(np.random.lognormal(5.5, 0.7)))
else:
amounts.append(float(np.random.lognormal(4.0, 0.6)))
return pd.Series(amounts)
# Generate customers
customers_df = (
spark.range(0, 10000, numPartitions=16)
.select(
F.concat(F.lit("CUST-"), F.lpad(F.col("id").cast("string"), 5, "0")).alias("customer_id"),
fake_name(F.col("id")).alias("name"),
F.when(F.rand() < 0.6, "Free")
.when(F.rand() < 0.9, "Pro")
.otherwise("Enterprise").alias("tier"),
)
.withColumn("arr", generate_amount(F.col("tier")))
)
# Save to Unity Catalog
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")
customers_df.write.mode("overwrite").parquet(f"/Volumes/{CATALOG}/{SCHEMA}/raw_data/customers")
Common Patterns
Weighted Tier Distribution
F.when(F.rand() < 0.6, "Free")
.when(F.rand() < 0.9, "Pro")
.otherwise("Enterprise").alias("tier")
Log-Normal Amounts (Realistic Pricing)
@F.pandas_udf(DoubleType())
def generate_amount(tiers: pd.Series) -> pd.Series:
return pd.Series([
float(np.random.lognormal({"Enterprise": 7.5, "Pro": 5.5, "Free": 4.0}[t], 0.7))
for t in tiers
])
Date Range (Last 6 Months)
from datetime import datetime, timedelta
END_DATE = datetime.now()
START_DATE = END_DATE - timedelta(days=180)
F.date_add(F.lit(START_DATE.date()), (F.rand() * 180).cast("int")).alias("order_date")
Infrastructure Creation
# Always in script - assume catalog exists
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")
Execution Modes
| Mode | Best For | Setup |
|---|---|---|
| DB Connect 16.4+ Serverless | Local dev, Python 3.12+ | DatabricksEnv().withDependencies(...) |
| Serverless Job | Production, scheduled | Job with environments parameter |
| Classic Cluster | Fallback only | Use Databricks CLI to install libraries. databricks libraries install --json '{"cluster_id": "<cluster_id>", "libraries": [{"pypi": {"package": "faker"}}, {"pypi": {"package": "holidays"}}]}' |
See references/1-setup-and-execution.md for detailed setup instructions.
Output Formats
| Format | Use Case | Code |
|---|---|---|
| Parquet (default) | SDP pipeline input | df.write.parquet(path) |
| JSON | Log-style ingestion | df.write.json(path) |
| CSV | Legacy systems | df.write.option("header", "true").csv(path) |
| Delta Table | Direct analytics | df.write.saveAsTable("catalog.schema.table") |
See references/5-output-formats.md for detailed options.
Best Practices Summary
Execution
- Use serverless (instant start, no cluster wait)
- Ask for catalog/schema
- Present plan before generating
Data Generation
- Spark + Faker + Pandas UDFs for all cases
- Master tables first, then related tables with valid FKs
- Non-linear distributions (log-normal, Pareto, exponential)
- Time patterns (weekday/weekend, holidays, seasonality)
- Row coherence (correlated attributes)
Output
- Create infrastructure in script (
CREATE SCHEMA/VOLUME IF NOT EXISTS) - Do NOT create catalogs - assume they exist
- Delta tables as default
Related Skills
- databricks-unity-catalog - Managing catalogs, schemas, and volumes
- databricks-asset-bundles - DABs for production deployment
Common Issues
| Issue | Solution |
|---|---|
ModuleNotFoundError: faker |
See references/1-setup-and-execution.md |
| Faker UDF is slow | Use pandas_udf for batch processing |
| Out of memory | Increase numPartitions in spark.range() |
| Referential integrity errors | Write master table to Delta first, read back for FK joins |
PERSIST TABLE is not supported on serverless |
NEVER use .cache() or .persist() with serverless - write to Delta table first, then read back |
F.window vs Window confusion |
Use from pyspark.sql.window import Window for row_number(), rank(), etc. F.window is for streaming only. |
See references/6-troubleshooting.md for full troubleshooting guide.