ecommerce-data-warehouse
E-commerce Data Warehouse
Overview
An ecommerce data warehouse centralizes data from your store, ad platforms, shipping carriers, and accounting system into a single analytics layer — enabling reports that no single platform can produce on its own. Most merchants start needing a warehouse when they outgrow their platform's built-in analytics (typically around $1M+ ARR or when managing multiple channels), want to combine ad spend with order data for true ROAS, or need SQL-level access to run custom cohort and attribution analyses.
This skill guides you through extracting data from your platform, loading it into BigQuery (recommended for cost), and transforming it with dbt so your team can build dashboards in Looker Studio, Metabase, or Tableau.
When to Use This Skill
- When your platform's built-in analytics cannot answer important business questions
- When you need to combine data from multiple sources (Shopify + Meta Ads + Google Ads + shipping)
- When you want to build custom cohort retention, LTV, and attribution models
- When analysts or data scientists need SQL access to raw order data
- When you need a single source of truth across multiple sales channels
Core Instructions
Step 1: Choose your warehouse and set up the infrastructure
For most ecommerce merchants, BigQuery is the best starting point:
- Free tier: first 10 GB storage + 1 TB queries per month free
- No infrastructure to manage — serverless
- Excellent integrations with Looker Studio (free BI tool), dbt, and every major data integration tool
Alternatives:
- Snowflake: Better for teams that need time-travel, data sharing, or advanced concurrency; more expensive at small scale
- Redshift: Good if you are already in AWS; more operational overhead than BigQuery or Snowflake
Set up BigQuery:
- Go to console.cloud.google.com and create a new Google Cloud project
- Enable the BigQuery API
- Create a dataset (e.g.,
ecommerce_raw) for raw ingested data and a second dataset (ecommerce_analytics) for transformed tables
Step 2: Extract data from your ecommerce platform
The easiest way to get platform data into BigQuery is a managed connector. Avoid building custom extractors unless you have a specific requirement.
Shopify
Option A: Fivetran (recommended, $$$)
- Sign up at fivetran.com
- Add the Shopify connector — enter your shop URL and grant API access
- Fivetran syncs orders, customers, products, inventory, and financial data incrementally to BigQuery
- Pre-built dbt package:
dbt-fivetran/shopify— provides staging and mart models out of the box
Option B: Stitch (mid-price)
- Sign up at stitchdata.com
- Add the Shopify integration — select tables to sync (orders, order_line_items, customers, products)
- Stitch loads raw data to BigQuery; you transform with dbt
Option C: Shopify's native data export (free, manual)
- Go to Analytics → Reports → [any report] and click Export — available for orders, customers, products
- For bulk data: Go to Settings → Bulk operations or use the Shopify Admin API bulk query to export all data as JSONL
- Load JSONL files to BigQuery using bq load CLI command or Google Cloud Storage
Option D: Polar Analytics (simplest, all-in-one)
- Install Polar Analytics from the Shopify App Store — it connects Shopify + all ad platforms + shipping and provides a pre-built BigQuery export with a standard schema; skip building the pipeline yourself
WooCommerce
Option A: Stitch or Airbyte
- Stitch: Does not have a native WooCommerce connector; use their MySQL connector to connect directly to your WordPress database
- Airbyte (open source, self-hosted): Has a WooCommerce connector; run on a $20/month VM or use Airbyte Cloud
Option B: Direct MySQL replication WooCommerce stores all data in MySQL. For a technical team:
- Enable binary log replication on your MySQL instance
- Use Debezium or Airbyte's MySQL CDC connector to stream changes to BigQuery
- This requires DevOps expertise but provides the most real-time data
Option C: Metorik export
- Use Metorik to pull and transform WooCommerce data into clean CSVs
- Schedule CSV exports and load to BigQuery via Cloud Functions or a simple Google Apps Script
BigCommerce
Option A: Fivetran BigCommerce connector
- BigCommerce is available as a Fivetran source; set up the same way as Shopify
Option B: BigCommerce Data Solutions
- BigCommerce offers a native Insights product (available on higher-tier plans) that provides pre-built analytics; check if this meets your needs before building a warehouse
Option C: Stitch BigCommerce connector
- Available in Stitch; syncs orders, customers, and products to BigQuery
Step 3: Install and configure dbt
dbt (data build tool) transforms raw ingested data into clean analytics-ready tables using SQL. It handles dependency management, testing, and documentation.
Install dbt:
pip install dbt-bigquery # for BigQuery
# or: pip install dbt-snowflake / dbt-redshift
Initialize a project:
dbt init ecommerce_analytics
cd ecommerce_analytics
Configure your profile in ~/.dbt/profiles.yml (points to your BigQuery project).
Step 4: Build dbt models for your ecommerce data
If using Fivetran + Shopify, install the pre-built Shopify dbt package:
# packages.yml
packages:
- package: fivetran/shopify
version: [">=0.10.0", "<0.11.0"]
Run dbt deps to install, then dbt run to build all models.
For other platforms or custom schemas, build models in this three-layer pattern:
Staging layer — clean raw data, no business logic:
-- models/staging/stg_orders.sql
with source as (select * from {{ source('shopify_raw', 'order') }}),
renamed as (
select
id::varchar as order_id,
name as order_number,
customer_id::varchar as customer_id,
email,
financial_status,
fulfillment_status,
total_price::numeric(12,2) as total_price,
subtotal_price::numeric(12,2) as subtotal_price,
total_discounts::numeric(10,2) as total_discounts,
currency,
source_name as channel,
created_at as ordered_at
from source
where _fivetran_deleted = false
)
select * from renamed
Mart layer — business-ready tables for dashboards:
-- models/marts/fct_orders.sql
with orders as (select * from {{ ref('stg_orders') }}),
customers as (select * from {{ ref('stg_customers') }}),
first_orders as (
select customer_id, min(ordered_at) as first_order_at
from orders group by 1
)
select
o.order_id,
o.order_number,
o.customer_id,
o.ordered_at,
o.total_price,
o.subtotal_price,
o.total_discounts,
o.channel,
o.ordered_at = fo.first_order_at as is_first_order,
date_trunc(o.ordered_at, month) as order_month
from orders o
left join first_orders fo on o.customer_id = fo.customer_id
where o.financial_status not in ('voided', 'pending')
Step 5: Add data quality tests
Add tests in your schema YAML files so broken pipelines are caught before they reach dashboards:
# models/marts/schema.yml
models:
- name: fct_orders
columns:
- name: order_id
tests: [unique, not_null]
- name: total_price
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
Run tests: dbt test
Step 6: Connect a BI tool to your warehouse
Once data is in BigQuery, connect a visualization layer:
- Google Looker Studio (free): Go to lookerstudio.google.com → Add data source → BigQuery → select your project and tables; build dashboards with drag-and-drop
- Metabase (open source, self-hosted): Install on a $10/month VM; connect to BigQuery; lets non-technical users browse data and build charts with a GUI
- Tableau, Looker, Power BI: Standard enterprise options; Looker is the most powerful for ecommerce analytics but has significant cost and setup overhead
Best Practices
- Start with managed connectors, not custom code — Fivetran or Stitch cost $200–$500/month but save 40+ engineering hours; the breakeven is fast unless you have a very low traffic store
- Use the dbt Shopify package if you are on Shopify + Fivetran — it provides production-quality models for 15+ commonly needed tables out of the box; no need to reinvent
- Implement Slowly Changing Dimensions (SCD Type 2) for products and customers — track historical changes so you can analyze orders with the attributes that were true at the time of purchase
- Build a date dimension table — pre-populate with a full calendar including fiscal periods, holidays, and week numbers; every fact table should join to it
- Add dbt tests before connecting dashboards — uniqueness, not-null, and referential integrity tests catch data pipeline failures before they produce wrong numbers in reports
- Separate staging, intermediate, and mart layers — staging cleans raw data, intermediate models join and enrich, marts are the final analytics-ready tables your BI tool queries
Common Pitfalls
| Problem | Solution |
|---|---|
| Revenue in warehouse does not match platform analytics | Verify currency handling (cents vs. dollars), discount application order, and tax inclusion/exclusion; build a daily reconciliation check that compares warehouse totals to platform API totals |
| Historical product prices not captured | Implement SCD Type 2 on the product dimension and join fact tables to the product row that was current at the time of the order |
| ETL fails mid-run leaving partial data | Use dbt's on_schema_change: sync_all_columns and atomic table swaps; for incremental models, configure a unique_key to handle re-runs idempotently |
| Dashboard queries are too slow | Pre-aggregate in daily summary fact tables; partition large tables by date in BigQuery; avoid running heavy SQL on load |
| Customer identity not resolved across channels | Build a customer identity resolution table that maps email, phone, and platform-specific IDs to a single canonical customer key |
Related Skills
- @attribution-modeling
- @customer-analytics
- @sales-reporting-dashboard
- @product-analytics