duckdb

Installation

SKILL.md

DuckDB Local Skill for Claude

Local-first SQL analytics on files. Read, profile, clean, and export data without a warehouse.

When to Use This Skill

Activate when: Reading local files (CSV, Parquet, JSON, Excel) into DuckDB, profiling data quality, running SQL analytics on local data, cleaning/transforming datasets, exporting results to various formats.

Don't use for: Building dbt models (use dbt-transforms with DuckDB adapter), cloud warehouse administration, pipeline scheduling (use data-pipelines), Python DataFrame transformations (use python-data-engineering).

Scope Constraints

Local files only -- does not manage cloud warehouse connections or remote databases.
No cloud warehouse administration (Snowflake, BigQuery, Redshift).
Reference files loaded on demand -- do not pre-load multiple references.
File paths must be validated before use; never execute user-provided paths without checking existence.

Model Routing

reasoning_demand	preferred	acceptable	minimum
medium	Sonnet	Sonnet, Opus	Sonnet

Core Principles

Local-first -- Process data on the user's machine; no cloud credentials required.
SQL-native -- Use SQL for all transformations; avoid unnecessary Python wrappers.
Schema inference -- Let DuckDB auto-detect types; override only when inference fails.
Zero config -- Start with import duckdb or duckdb CLI; no server setup.
Format-agnostic -- Read CSV, Parquet, JSON, Excel through unified SQL interface.

Default Engine Positioning

DuckDB is the default analytical engine for this skill suite. Assume DuckDB unless told otherwise.

When to target a warehouse instead:

Data volume exceeds ~100GB or requires concurrent multi-user access
Client provides warehouse access (Snowflake, BigQuery, Databricks)
Production workload needs scheduling, monitoring, and SLA guarantees
Regulatory requirements mandate data residency in a managed platform

When a warehouse is needed, DuckDB still serves as the local dev and validation target. Build and test locally, deploy to warehouse.

Core Usage Patterns

Pattern	Description	When to Use
DuckDB as dbt target	`dbt-duckdb` adapter for local transforms	Prototyping, client demos, small datasets
Standalone query engine	Direct SQL on files, no framework	Ad-hoc analysis, data profiling, one-off cleaning
DuckDB + DLT	DLT loads files into DuckDB	Development pipeline with destination swapping
Dev target, warehouse deploy	DuckDB locally, Snowflake/BigQuery in prod	Full engagement lifecycle

Quick Start

pip install duckdb

Python

import duckdb

# Read a CSV and query it
result = duckdb.sql("SELECT * FROM read_csv_auto('data.csv') LIMIT 10")
result.show()

CLI

duckdb -c "SELECT count(*) FROM read_csv_auto('data.csv')"

Persistent Database

con = duckdb.connect("my_analysis.duckdb")
con.sql("CREATE TABLE customers AS SELECT * FROM read_csv_auto('customers.csv')")
con.sql("SELECT count(*) FROM customers").show()

Excel Files

INSTALL excel; LOAD excel;
SELECT * FROM read_xlsx('data.xlsx', sheet='Sheet1');

File Ingestion

Format	Function	Auto-Detect	Glob Support	Notes
CSV/TSV	`read_csv_auto()`	delimiter, header, types	Yes	Most common; handles most encodings
Parquet	`read_parquet()`	schema from metadata	Yes	Fastest; preserves types perfectly
JSON	`read_json_auto()`	structure, nesting	Yes	Handles NDJSON and arrays
Excel	`read_xlsx()` (excel ext)	sheet, headers	No	Requires `INSTALL excel; LOAD excel;` (DuckDB >=1.2)

Basic Ingestion

-- CSV with auto-detection
SELECT * FROM read_csv_auto('customers.csv');

-- Parquet with glob
SELECT * FROM read_parquet('data/events/*.parquet');

-- JSON (newline-delimited)
SELECT * FROM read_json_auto('logs.ndjson', format='newline_delimited');

-- Multiple CSVs, union by column name
SELECT * FROM read_csv_auto('data/*.csv', union_by_name=true);

For detailed ingestion options, see: File Ingestion Reference For Excel-specific patterns, see: Excel Specifics

Data Profiling Patterns

Column Statistics

SELECT
    column_name
    , column_type
    , count
    , null_percentage
    , approx_unique
    , min
    , max
FROM (SUMMARIZE SELECT * FROM read_csv_auto('data.csv'));

NULL Rates and Cardinality

SELECT
    count(*) as total_rows
    , count(email) as email_present
    , round(100.0 * count(email) / count(*), 1) as email_pct
    , count(distinct email) as email_unique
    , count(*) - count(distinct email) as email_duplicates
FROM read_csv_auto('customers.csv');

Value Distribution

-- Top 10 values for a column
SELECT status, count(*) as cnt, round(100.0 * count(*) / sum(count(*)) over (), 1) as pct
FROM read_csv_auto('orders.csv')
GROUP BY status
ORDER BY cnt DESC
LIMIT 10;

Cleaning in SQL

Common transformations executed directly in DuckDB:

CREATE TABLE customers_clean AS
SELECT
    -- Dedup
    * EXCLUDE (row_num)
FROM (
    SELECT
        *
        , row_number() OVER (PARTITION BY email ORDER BY updated_at DESC) as row_num
    FROM read_csv_auto('customers.csv')
)
WHERE row_num = 1;

-- Type casting and trimming
SELECT
    trim(name) as name
    , lower(trim(email)) as email
    , try_cast(revenue as decimal(12,2)) as revenue
    , strptime(date_str, '%m/%d/%Y')::date as order_date
    , regexp_replace(phone, '[^0-9]', '', 'g') as phone_digits
FROM read_csv_auto('messy_data.csv');

Export Patterns

-- CSV
COPY (SELECT * FROM customers_clean) TO 'output/customers.csv' (HEADER, DELIMITER ',');

-- Parquet (recommended for downstream analytics)
COPY (SELECT * FROM customers_clean) TO 'output/customers.parquet' (FORMAT PARQUET);

-- JSON (newline-delimited)
COPY (SELECT * FROM customers_clean) TO 'output/customers.json' (FORMAT JSON);

For detailed export options, see: Export Patterns Reference

Performance Tips

Setting	Command	Use When
Memory limit	`SET memory_limit = '4GB';`	Large datasets on constrained machines
Threads	`SET threads = 4;`	Control parallelism
Temp directory	`SET temp_directory = '/tmp/duckdb';`	Spill to disk for large sorts
Progress bar	`SET enable_progress_bar = true;`	Long-running queries

File format performance: Parquet >> CSV >> JSON. Convert to Parquet early for repeated queries.

For detailed performance tuning, see: Performance Reference

Security Posture

See Security & Compliance Patterns for the full framework. See Consulting Security Tier Model for tier definitions.

Capability	Tier 1 (Schema-Only)	Tier 2 (Sampled)	Tier 3 (Full Access)
Read local files	Metadata only	Anonymized samples	Full data
Write/export files	Schema DDL only	Sample outputs	Full exports
Profiling	Column types/names	Stats on samples	Full profiling
Cleaning transforms	Compile/validate SQL	Execute on samples	Execute on full data

Never hardcode file paths containing credentials or sensitive data in scripts.
Never commit data files to version control; use .gitignore.
Validate file paths exist before reading; reject paths outside the working directory.

Reference Files

Reference files loaded on demand:

File Ingestion -- CSV, Parquet, JSON parameters, glob patterns, schema inference, common issues
Excel Specifics -- Excel extension (read_xlsx), sheet selection, merged cells, date conversion, spatial fallback
Export Patterns -- CSV/Parquet/JSON options, Python API export, partitioned writes
Performance -- Memory limits, threads, EXPLAIN ANALYZE, pragma settings, format comparison

Related skills

More from dtsong/data-engineering-skills

Installs

Repository

dtsong/data-eng…g-skills

GitHub Stars

First Seen

Mar 10, 2026