duckdb

SKILL.md

DuckDB Local Skill for Claude

Local-first SQL analytics on files. Read, profile, clean, and export data without a warehouse.

When to Use This Skill

Activate when: Reading local files (CSV, Parquet, JSON, Excel) into DuckDB, profiling data quality, running SQL analytics on local data, cleaning/transforming datasets, exporting results to various formats.

Don't use for: Building dbt models (use dbt-transforms with DuckDB adapter), cloud warehouse administration, pipeline scheduling (use data-pipelines), Python DataFrame transformations (use python-data-engineering).

Scope Constraints

  • Local files only -- does not manage cloud warehouse connections or remote databases.
  • No cloud warehouse administration (Snowflake, BigQuery, Redshift).
  • Reference files loaded on demand -- do not pre-load multiple references.
  • File paths must be validated before use; never execute user-provided paths without checking existence.

Model Routing

reasoning_demand preferred acceptable minimum
medium Sonnet Sonnet, Opus Sonnet

Core Principles

  1. Local-first -- Process data on the user's machine; no cloud credentials required.
  2. SQL-native -- Use SQL for all transformations; avoid unnecessary Python wrappers.
  3. Schema inference -- Let DuckDB auto-detect types; override only when inference fails.
  4. Zero config -- Start with import duckdb or duckdb CLI; no server setup.
  5. Format-agnostic -- Read CSV, Parquet, JSON, Excel through unified SQL interface.

Default Engine Positioning

DuckDB is the default analytical engine for this skill suite. Assume DuckDB unless told otherwise.

When to target a warehouse instead:

  • Data volume exceeds ~100GB or requires concurrent multi-user access
  • Client provides warehouse access (Snowflake, BigQuery, Databricks)
  • Production workload needs scheduling, monitoring, and SLA guarantees
  • Regulatory requirements mandate data residency in a managed platform

When a warehouse is needed, DuckDB still serves as the local dev and validation target. Build and test locally, deploy to warehouse.

Core Usage Patterns

Pattern Description When to Use
DuckDB as dbt target dbt-duckdb adapter for local transforms Prototyping, client demos, small datasets
Standalone query engine Direct SQL on files, no framework Ad-hoc analysis, data profiling, one-off cleaning
DuckDB + DLT DLT loads files into DuckDB Development pipeline with destination swapping
Dev target, warehouse deploy DuckDB locally, Snowflake/BigQuery in prod Full engagement lifecycle

Quick Start

pip install duckdb

Python

import duckdb

# Read a CSV and query it
result = duckdb.sql("SELECT * FROM read_csv_auto('data.csv') LIMIT 10")
result.show()

CLI

duckdb -c "SELECT count(*) FROM read_csv_auto('data.csv')"

Persistent Database

con = duckdb.connect("my_analysis.duckdb")
con.sql("CREATE TABLE customers AS SELECT * FROM read_csv_auto('customers.csv')")
con.sql("SELECT count(*) FROM customers").show()

Excel Files

INSTALL excel; LOAD excel;
SELECT * FROM read_xlsx('data.xlsx', sheet='Sheet1');

File Ingestion

Format Function Auto-Detect Glob Support Notes
CSV/TSV read_csv_auto() delimiter, header, types Yes Most common; handles most encodings
Parquet read_parquet() schema from metadata Yes Fastest; preserves types perfectly
JSON read_json_auto() structure, nesting Yes Handles NDJSON and arrays
Excel read_xlsx() (excel ext) sheet, headers No Requires INSTALL excel; LOAD excel; (DuckDB >=1.2)

Basic Ingestion

-- CSV with auto-detection
SELECT * FROM read_csv_auto('customers.csv');

-- Parquet with glob
SELECT * FROM read_parquet('data/events/*.parquet');

-- JSON (newline-delimited)
SELECT * FROM read_json_auto('logs.ndjson', format='newline_delimited');

-- Multiple CSVs, union by column name
SELECT * FROM read_csv_auto('data/*.csv', union_by_name=true);

For detailed ingestion options, see: File Ingestion Reference For Excel-specific patterns, see: Excel Specifics

Data Profiling Patterns

Column Statistics

SELECT
    column_name
    , column_type
    , count
    , null_percentage
    , approx_unique
    , min
    , max
FROM (SUMMARIZE SELECT * FROM read_csv_auto('data.csv'));

NULL Rates and Cardinality

SELECT
    count(*) as total_rows
    , count(email) as email_present
    , round(100.0 * count(email) / count(*), 1) as email_pct
    , count(distinct email) as email_unique
    , count(*) - count(distinct email) as email_duplicates
FROM read_csv_auto('customers.csv');

Value Distribution

-- Top 10 values for a column
SELECT status, count(*) as cnt, round(100.0 * count(*) / sum(count(*)) over (), 1) as pct
FROM read_csv_auto('orders.csv')
GROUP BY status
ORDER BY cnt DESC
LIMIT 10;

Cleaning in SQL

Common transformations executed directly in DuckDB:

CREATE TABLE customers_clean AS
SELECT
    -- Dedup
    * EXCLUDE (row_num)
FROM (
    SELECT
        *
        , row_number() OVER (PARTITION BY email ORDER BY updated_at DESC) as row_num
    FROM read_csv_auto('customers.csv')
)
WHERE row_num = 1;
-- Type casting and trimming
SELECT
    trim(name) as name
    , lower(trim(email)) as email
    , try_cast(revenue as decimal(12,2)) as revenue
    , strptime(date_str, '%m/%d/%Y')::date as order_date
    , regexp_replace(phone, '[^0-9]', '', 'g') as phone_digits
FROM read_csv_auto('messy_data.csv');

Export Patterns

-- CSV
COPY (SELECT * FROM customers_clean) TO 'output/customers.csv' (HEADER, DELIMITER ',');

-- Parquet (recommended for downstream analytics)
COPY (SELECT * FROM customers_clean) TO 'output/customers.parquet' (FORMAT PARQUET);

-- JSON (newline-delimited)
COPY (SELECT * FROM customers_clean) TO 'output/customers.json' (FORMAT JSON);

For detailed export options, see: Export Patterns Reference

Performance Tips

Setting Command Use When
Memory limit SET memory_limit = '4GB'; Large datasets on constrained machines
Threads SET threads = 4; Control parallelism
Temp directory SET temp_directory = '/tmp/duckdb'; Spill to disk for large sorts
Progress bar SET enable_progress_bar = true; Long-running queries

File format performance: Parquet >> CSV >> JSON. Convert to Parquet early for repeated queries.

For detailed performance tuning, see: Performance Reference

Security Posture

See Security & Compliance Patterns for the full framework. See Consulting Security Tier Model for tier definitions.

Capability Tier 1 (Schema-Only) Tier 2 (Sampled) Tier 3 (Full Access)
Read local files Metadata only Anonymized samples Full data
Write/export files Schema DDL only Sample outputs Full exports
Profiling Column types/names Stats on samples Full profiling
Cleaning transforms Compile/validate SQL Execute on samples Execute on full data
  • Never hardcode file paths containing credentials or sensitive data in scripts.
  • Never commit data files to version control; use .gitignore.
  • Validate file paths exist before reading; reject paths outside the working directory.

Reference Files

Reference files loaded on demand:

  • File Ingestion -- CSV, Parquet, JSON parameters, glob patterns, schema inference, common issues
  • Excel Specifics -- Excel extension (read_xlsx), sheet selection, merged cells, date conversion, spatial fallback
  • Export Patterns -- CSV/Parquet/JSON options, Python API export, partitioned writes
  • Performance -- Memory limits, threads, EXPLAIN ANALYZE, pragma settings, format comparison
Weekly Installs
1
GitHub Stars
1
First Seen
8 days ago
Installed on
zencoder1
amp1
cline1
openclaw1
opencode1
cursor1