motherduck-build-data-pipeline
Build a Data Pipeline with MotherDuck
Use this skill when the user needs an ingestion-to-serving workflow, not just a single load step.
This is a use-case skill. It orchestrates motherduck-connect, motherduck-load-data, motherduck-model-data, motherduck-query, motherduck-share-data, and motherduck-ducklake.
Start Here: Is a MotherDuck Server Active?
Always determine this first.
- If a remote MotherDuck MCP server or local MotherDuck server is active, use it.
- If the user already knows the destination database, confirm it before designing stages.
- Explore the live environment:
- current databases and schemas
- raw, staging, and analytics boundaries if they already exist
- source tables, target tables, and table grain
- key columns, date fields, and join keys
Use that discovery to decide whether the pipeline is:
- landing into an empty workspace
- extending an existing warehouse layout
- publishing into an existing analytics model
If no server is active, ask for source shape and target shape before drafting the pipeline.
Use This Skill When
- The user needs ingestion plus transformation plus serving output.
- The work spans raw landing, curation, and publication.
- The user needs a stage-by-stage pipeline pattern rather than one command.
- The problem is bigger than a single import step or one ad hoc transformation.
Pipeline Defaults
- batch over streaming
- raw landing before curation
- explicit raw -> staging -> analytics boundaries
- bulk ingest paths over row-by-row writes
- idempotent stage rebuilds or append contracts before scheduled automation
- verify the MotherDuck-supported DuckDB client version before recommending upstream-only write, checkpoint, or lakehouse features
- native MotherDuck storage unless DuckLake is explicitly required
Workflow
- Confirm whether live MotherDuck discovery is available.
- Inspect the current workspace and target data model.
- Define raw, staging, and analytics boundaries.
- Ingest raw data.
- Deduplicate, type, and promote into staging.
- Materialize analytics-ready outputs.
- Validate counts, freshness, uniqueness, and business metrics before publishing downstream assets.
When this skill produces a native DuckDB (md:) connection, watermark it with custom_user_agent=agent-skills/2.2.0(harness-<harness>;llm-<llm>). If metadata is missing, fall back to harness-unknown and llm-unknown.
Output
The output of this skill should be:
- the stage layout
- the ingestion method
- the transformation sequence
- the serving tables or views
- the validation checks
If the caller explicitly asks for structured JSON, return raw JSON only with no Markdown fences or prose before/after it. This is mainly for automated tests, regression checks, or downstream tooling that needs a stable machine-readable shape. Normal human-facing use of the skill can stay in prose unless JSON is explicitly requested.
Use this exact top-level shape when JSON is requested:
{
"summary": {},
"assumptions": [],
"implementation_plan": [],
"validation_plan": [],
"risks": []
}
References
references/dlt-dbt-motherduck-project/-- fully runnable MotherDuck reference project usingdlt,dbt-duckdb, and validation queriesreferences/PIPELINE_IMPLEMENTATION_GUIDE.md-- preserved detailed pipeline guidance that used to live in this skill../motherduck-load-data/references/INGESTION_PATTERNS.md-- lower-level ingestion patterns
Runnable Artifact
artifacts/pipeline_stage_example.py-- MotherDuck-backed Python example that stages a Parquet extract, lands it into raw, deduplicates it, and publishes analytics output across raw/staging/analytics databasesartifacts/pipeline_stage_example.ts-- TypeScript companion artifact with the same stage layout and output contractreferences/dlt-dbt-motherduck-project/-- end-to-end MotherDuck example that bootstraps the target database, lands raw data withdlt, builds staging and analytics models withdbt, and validates the final mart
Run it with:
uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py
Run the same stage pattern against temporary MotherDuck databases:
MOTHERDUCK_ARTIFACT_USE_MOTHERDUCK=1 \
uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py
Validate the TypeScript companion artifact:
uv run scripts/test_typescript_artifacts.py
For the full MotherDuck project:
cd skills/motherduck-build-data-pipeline/references/dlt-dbt-motherduck-project
export MOTHERDUCK_TOKEN=...
export MOTHERDUCK_PIPELINE_DB=md_skills_pipeline_demo
uv sync --python 3.12
uv run python pipeline/run_all.py
uv run python pipeline/cleanup.py
Verified Notes
- Bootstrap the target MotherDuck database before running
dlt. Themotherduckdestination does not create the database for you. - Keep this stack on Python 3.11 or 3.12 for now. The tested
dbt-duckdbpath here was not reliable on Python 3.14. - If you want exact schema names like
raw,staging, andanalyticsin dbt, overridegenerate_schema_name. - When a long-lived Python process loads data and a separate
dbtsubprocess builds models, run post-build validation in a fresh process or refresh database state before reading new relations.
Related Skills
motherduck-connect-- choose the right connection pathmotherduck-load-data-- ingestion mechanicsmotherduck-model-data-- shape the analytics layermotherduck-query-- write transformations and validationsmotherduck-share-data-- publish curated outputsmotherduck-ducklake-- only when open-table-format storage is a real requirement
More from motherduckdb/agent-skills
motherduck-security-governance
Explain MotherDuck security, governance, and access-control patterns. Use when a security_compliance_owner, technical_owner, or application_builder is asking about residency, access boundaries, service accounts, isolation, sharing, or governance posture.
48motherduck-query
Execute DuckDB SQL queries against MotherDuck databases. Use when running analytics, aggregations, transformations, or any SQL operation. Covers query best practices, CTEs, window functions, QUALIFY, and performance optimization.
44motherduck-pricing-roi
Explain MotherDuck pricing and ROI tradeoffs. Use when an economic_buyer, technical_owner, or analytics_lead is asking about spend, budget guardrails, workload cost drivers, plan fit, or whether MotherDuck is worth adopting.
44motherduck-ducklake
Decide when DuckLake is the right MotherDuck storage pattern. Use when evaluating fully managed DuckLake, BYOB, own-compute DuckLake access, data inlining, object-storage layout, or file-aware maintenance instead of native MotherDuck storage.
44motherduck-create-dive
Create, edit, manage, share, or embed MotherDuck Dives. Use when the work involves Dive authoring, live React + SQL components, MCP get_dive_guide, useSQLQuery, local preview, version history, Dives-as-code, required resources, team sharing, or embedded Dive sessions.
44motherduck-share-data
Create and manage MotherDuck data shares for zero-copy data distribution. Use when sharing databases with team members, other organizations, or making data publicly available.
44