wren-dlt-connector

Originally fromcanner/wren-engine
Installation
SKILL.md

wren-dlt-connector

Connect SaaS data to Wren Engine for SQL analysis — from zero to a verified, queryable project in one conversation.

Who this is for

Data analysts who know SQL and some Python, but may not have used dlt or Wren before. Explain concepts briefly when they first appear, but don't over-explain things a SQL-literate person would already know.

Overview

This skill walks through a four-phase workflow:

  1. Extract — Use dlt (data load tool) to pull data from a SaaS API into a local DuckDB file
  2. Model — Introspect the DuckDB schema and auto-generate a Wren semantic project (YAML models, relationships, profile)
  3. Build & Verify — Build the project and run actual SQL queries to confirm everything works end-to-end
  4. Handoff — Show the user their data and next steps

The user might enter at any phase. Ask which phase they're starting from — they may already have a .duckdb file and just need phases 2–4.

The goal is a project that actually queries successfully, not just files that look correct. Always run the verification step before declaring success.

Critical: DuckDB catalog naming

When wren engine connects to a DuckDB file, it ATTACHes it using the filename (without .duckdb extension) as the catalog alias:

ATTACH DATABASE 'stripe_data.duckdb' AS "stripe_data" (READ_ONLY)

This means every model's table_reference.catalog must equal the DuckDB filename stem. If the file is hubspot.duckdb, the catalog is hubspot. If it's my_pipeline.duckdb, the catalog is my_pipeline.

Getting this wrong causes "table not found" errors at query time. The introspect_dlt.py script handles this automatically.

Critical: Type normalization

Column types must be normalized using wren SDK's type_mapping.parse_type() function, which uses sqlglot to convert database-specific types (like DuckDB's HUGEINT, TIMESTAMP WITH TIME ZONE) into canonical SQL types that wren-core understands. Do not hardcode type mappings — always delegate to parse_type(raw_type, "duckdb").

The introspect_dlt.py script does this automatically when wren SDK is installed.

Phase 1: Extract — dlt Pipeline Setup

Step 1: Pick the SaaS source

Ask the user which SaaS service they want to connect. Read references/dlt_sources.md for a list of popular verified sources and their auth requirements. If the source isn't listed, check whether dlt has a verified source for it by searching dlthub.com/docs/dlt-ecosystem/verified-sources.

Step 2: Install dlt

pip install "dlt[duckdb]" --break-system-packages

Step 3: Write the pipeline script

Create a Python script that:

  1. Imports the dlt source function for the chosen SaaS
  2. Configures the pipeline with destination='duckdb' and a local file path
  3. Runs the pipeline with pipeline.run(source)

Here's the general pattern — adapt it per source (check references/dlt_sources.md for source-specific templates):

import dlt

pipeline = dlt.pipeline(
    pipeline_name="<source>_pipeline",
    destination="duckdb",
    dataset_name="<source>_data",
)

# Source-specific: check references/dlt_sources.md for auth patterns
source = <source_function>(api_key=dlt.secrets.value)

info = pipeline.run(source)
print(info)

Step 4: Set up credentials

dlt reads credentials from environment variables or .dlt/secrets.toml. The simplest approach for a one-time run:

# Set the credential as an environment variable
# The exact variable name depends on the source — check references/dlt_sources.md
export SOURCES__<SOURCE>__API_KEY="the-actual-key"

Ask the user for their API key or token. Remind them:

  • Never commit credentials to git
  • Environment variables are the simplest way for a one-time run
  • For repeated use, they can create .dlt/secrets.toml

Step 5: Run the pipeline

python <pipeline_script>.py

After the run, confirm:

  1. The pipeline completed without errors
  2. A .duckdb file was created (usually at <pipeline_name>.duckdb)
  3. Print discovered tables and their column counts
import duckdb
con = duckdb.connect("<pipeline_name>.duckdb", read_only=True)
for row in con.execute("""
    SELECT table_schema, table_name,
           (SELECT COUNT(*) FROM information_schema.columns c
            WHERE c.table_schema = t.table_schema AND c.table_name = t.table_name) as col_count
    FROM information_schema.tables t
    WHERE table_schema NOT IN ('information_schema', 'pg_catalog')
      AND table_name NOT LIKE '_dlt_%'
    ORDER BY table_schema, table_name
""").fetchall():
    print(f"  {row[0]}.{row[1]} ({row[2]} columns)")
con.close()

Phase 2: Model — Generate Wren Project

Run the introspection script to auto-generate a complete Wren project from the DuckDB file:

python <path-to-this-skill>/scripts/introspect_dlt.py \
    --duckdb-path <path-to-duckdb-file> \
    --output-dir <project-directory> \
    --project-name <name>

This script:

  • Connects to the DuckDB file (read-only)
  • Sets table_reference.catalog to the DuckDB filename stem (matching wren engine's ATTACH behavior)
  • Discovers all tables and columns via information_schema
  • Filters out dlt internal tables (_dlt_loads, _dlt_pipeline_state, etc.)
  • Filters out dlt metadata columns (_dlt_id, _dlt_load_id, _dlt_list_idx) from model definitions
  • Detects parent-child relationships from _dlt_parent_id columns and table naming conventions
  • Normalizes column types using wren.type_mapping.parse_type() (sqlglot-based)
  • Generates a complete v2 YAML project (wren_project.yml, models/, relationships.yml, instructions.md)

After running, show the user what was generated:

# Show project summary
cat <project-directory>/wren_project.yml
echo "---"
ls <project-directory>/models/
echo "---"
cat <project-directory>/relationships.yml

Verify model correctness

Spot-check one generated model to confirm:

  1. table_reference.catalog matches the DuckDB filename (e.g., stripe_data for stripe_data.duckdb)
  2. table_reference.schema matches the DuckDB schema (usually main)
  3. No _dlt_* columns appear in the columns list
  4. Column types look reasonable (VARCHAR, BIGINT, BOOLEAN, TIMESTAMP, etc.)

Set up the connection profile

Create a Wren profile so the user can query without specifying connection details every time. The url must point to the directory containing the .duckdb file (not the file itself):

import yaml
from pathlib import Path

wren_home = Path.home() / ".wren"
wren_home.mkdir(exist_ok=True)
profiles_file = wren_home / "profiles.yml"

existing = (yaml.safe_load(profiles_file.read_text()) or {}) if profiles_file.exists() else {}
existing.setdefault("profiles", {})

profile_name = "<source>_dlt"
existing["profiles"][profile_name] = {
    "datasource": "duckdb",
    "url": str(Path("<duckdb-path>").resolve().parent),
    "format": "duckdb",
}
existing["active"] = profile_name

profiles_file.write_text(yaml.dump(existing, default_flow_style=False, sort_keys=False))

Phase 3: Build & Verify — The Project Must Actually Work

This phase is not optional. A project that generates YAML but fails at query time is not a success.

Step 1: Build the MDL

cd <project-directory>
wren context build

This compiles the YAML models into target/mdl.json. If this fails, fix the issues before proceeding (see Troubleshooting below).

Step 2: Validate with a real query

Run at least one query per generated model to confirm the project is functional:

# For each model, verify it resolves correctly
wren --sql 'SELECT COUNT(*) as total FROM "<table_name>"'

If any query fails, debug and fix the model before moving on. Common issues:

  • Wrong catalog in table_reference → "table not found"
  • Type mismatch → fix the column type in metadata.yml
  • Missing profile → check wren profile list

Step 3: Run interesting queries

Once basic queries pass, run 2–3 more interesting queries to show the user what their data looks like:

# Preview data
wren --sql 'SELECT * FROM "<table_name>" LIMIT 5'

# If there's a relationship, verify both models are queryable
wren --sql 'SELECT * FROM "<parent>" LIMIT 5'
wren --sql 'SELECT * FROM "<child>" LIMIT 5'

Show the results to the user and explain what they're seeing. This is their first look at the data through Wren — make it count.

Step 4: Confirm success

Only after queries return real data, tell the user the setup is complete. Summarize:

  • How many models were created
  • What relationships were detected
  • Which profile is active
  • Example queries they can try next

Troubleshooting

If wren context build fails:

  • Check that data_source: duckdb is set in wren_project.yml
  • Verify the DuckDB file path in the profile is correct
  • Run wren context validate for detailed error messages

If queries fail with "table not found":

  • Most likely cause: table_reference.catalog doesn't match the DuckDB filename. If the file is pipeline.duckdb, the catalog must be pipeline, not empty string.
  • Check the profile's url points to the directory containing the .duckdb file
  • Table names with double underscores need quoting: "hubspot__contacts"

If queries fail with type errors:

  • Check column types in the model YAML — they should be canonical SQL types (VARCHAR, BIGINT, etc.)
  • Re-run introspect_dlt.py with wren SDK installed to get proper type normalization

General:

  • Check that the profile is active: wren profile list
  • The DuckDB file might be locked if a dlt pipeline is running — wait for it to finish

Important notes

  • dlt's _dlt_parent_id / _dlt_id columns are kept in the actual DuckDB tables but hidden from Wren model definitions. They're only used in relationship conditions.
  • DuckDB has a single-writer limitation. Don't run a dlt sync while querying. For concurrent access, dlt should write to a separate file and swap atomically.
  • The generated models use table_reference (not ref_sql) since they map directly to DuckDB tables created by dlt.
  • Column types are normalized using wren SDK's parse_type() with sqlglot's DuckDB dialect. If a type looks wrong, the user can edit the model's metadata.yml directly.
Related skills
Installs
16
Repository
canner/wrenai
GitHub Stars
15.1K
First Seen
3 days ago