dlt-extract

SKILL.md

DLT Extraction Skill for Claude

Expert guidance for building portable DLT pipelines focused on file-based sources and consulting data extraction.

When to Use This Skill

Activate when:

  • Building DLT pipelines for file sources (CSV, Excel, Parquet, JSON)
  • Configuring filesystem or SharePoint sources
  • Swapping destinations (DuckDB dev to warehouse prod)
  • Defining schema contracts for data cleaning
  • Building portable pipelines for client handoff
  • Processing files from SFTP, local directories, or cloud storage buckets

Do NOT use for:

  • Core DLT REST API or SQL database sources (use data-integration)
  • Pipeline scheduling and orchestration (use data-pipelines)
  • Standalone DuckDB queries without DLT (use duckdb)
  • Stream processing frameworks (use event-streaming)

Scope Constraints

File-based extraction and consulting portability only. Hands off to data-integration for REST API, SQL database, and enterprise connector patterns. Reference files loaded on demand — do not pre-load all references.

Model Routing

reasoning_demand preferred acceptable minimum
medium Sonnet Sonnet, Opus Sonnet

Core Principles

Destination-Agnostic: Write pipelines that swap DuckDB, Snowflake, BigQuery, or Databricks via a single environment variable. Never hardcode destination config in pipeline logic.

Schema-First: Define schema contracts before data flows. Catch drift, type mismatches, and unexpected columns at ingestion time rather than downstream.

File-Native: Treat messy file sources as first-class citizens. Handle encoding issues, inconsistent headers, multi-sheet Excel files, and mixed delimiters explicitly.

Portable: A pipeline should run with pip install + environment variables. No vendor-specific infrastructure required for development. Package dependencies in requirements.txt.

Incremental-by-Default: Track file modification timestamps or filenames to process only new files. Avoid reprocessing entire directories on every run.

File Source Decision Matrix

File Type DLT Approach Auto-Schema Incremental Support
CSV filesystem + read_csv() Yes (infer types) Modified-since tracking
Excel (.xlsx) Custom @dlt.resource with openpyxl Manual (define columns) Filename-based dedup
Parquet filesystem + read_parquet() Yes (preserved from file) Modified-since tracking
JSON/NDJSON filesystem + read_jsonl() Yes (infer from structure) Modified-since tracking
SharePoint filesystem with fsspec SharePoint backend Depends on file type Modified-since tracking
SFTP filesystem with fsspec SFTP backend Depends on file type Modified-since tracking

Destination Swapping Pattern

Same pipeline code targets different destinations via environment variable:

import os
import dlt

DESTINATION = os.getenv("DLT_DESTINATION", "duckdb")

pipeline = dlt.pipeline(
    pipeline_name="client_extract",
    destination=DESTINATION,
    dataset_name="raw",
)

# Development: DLT_DESTINATION unset → DuckDB (local, zero config)
# Production:  DLT_DESTINATION=snowflake → Snowflake (credentials in .dlt/secrets.toml)

Add staging for warehouse destinations to improve load performance:

if DESTINATION != "duckdb":
    pipeline = dlt.pipeline(
        pipeline_name="client_extract",
        destination=DESTINATION,
        staging="filesystem",  # S3 or GCS bucket
        dataset_name="raw",
    )

Schema Contracts for Cleaning

Define expected schema, detect drift, and validate before loading. Schema contracts control how DLT handles unexpected data:

Contract Mode Tables Columns Data Types Use Case
evolve Add new Add new Coerce Discovery phase
freeze Reject new Reject new Reject mismatch Production lockdown
discard_rows Reject new Drop row Drop row Strict validation
discard_columns Add new Drop column Coerce Flexible with guardrails

Apply contracts per resource:

@dlt.resource(
    schema_contract={"tables": "evolve", "columns": "freeze", "data_type": "freeze"}
)
def client_data():
    yield from read_files()

See Schema Contracts Reference for Pydantic models, drift detection, and per-table strategies.

Pipeline Structure

Standard project layout for portable consulting pipelines:

pipeline_project/
├── pipeline.py          # Main pipeline definition
├── sources/             # Custom source functions
│   ├── __init__.py
│   └── file_source.py
├── schemas/             # Schema contract files
│   └── expected.yaml
├── .dlt/
│   ├── config.toml      # Non-secret config (committed)
│   └── secrets.toml     # Secrets (gitignored)
├── requirements.txt     # Python dependencies
├── .gitignore           # Ignore .dlt/secrets.toml, *.duckdb
└── tests/               # Pipeline tests
    └── test_pipeline.py

Quick Start Example

Filesystem source reading CSVs into DuckDB:

import dlt
from dlt.sources.filesystem import filesystem, read_csv

pipeline = dlt.pipeline(
    pipeline_name="client_files",
    destination="duckdb",
    dataset_name="raw"
)

source = filesystem(
    bucket_url="./data/raw",
    file_glob="*.csv"
) | read_csv()

load_info = pipeline.run(source)
print(load_info)

Verify loaded data:

with pipeline.sql_client() as client:
    for table in pipeline.default_schema.tables:
        if not table.startswith("_dlt"):
            result = client.execute_sql(f"SELECT COUNT(*) FROM {table}")
            print(f"  {table}: {result[0][0]} rows")

Security Posture

See Security & Compliance Patterns and Security Tier Model for the full framework.

Credentials: All secrets stored in .dlt/secrets.toml (gitignored). Use environment variables in CI/CD and production.

Capability Tier 1 (Cloud-Native) Tier 2 (Regulated) Tier 3 (Air-Gapped)
File extraction Execute against dev data Generate for review Generate only
Destination config Deploy to dev Generate for review Generate only
Schema contracts Deploy and test Generate with validation Generate only
Pipeline templates Generate and run Generate for review Generate only

Credential best practices: Never commit .dlt/secrets.toml. Use dlt.secrets.value for auto-resolution from env vars. Prefer IAM/service accounts over static credentials for cloud destinations.

Reference Files

Load the appropriate reference for deep-dive guidance:

  • Source Patterns -- Custom source decorators, factory patterns, error handling, yield patterns, testing
  • File Sources -- CSV, Excel, Parquet, JSON, SharePoint, SFTP ingestion patterns
  • Schema Contracts -- Contract modes, Pydantic models, drift detection, per-table strategies
  • Destinations -- DuckDB, Snowflake, BigQuery config, staging, write dispositions, format selection
Weekly Installs
1
GitHub Stars
1
First Seen
7 days ago
Installed on
zencoder1
amp1
cline1
openclaw1
opencode1
cursor1