duckdb
DuckDB Local Skill for Claude
Local-first SQL analytics on files. Read, profile, clean, and export data without a warehouse.
When to Use This Skill
Activate when: Reading local files (CSV, Parquet, JSON, Excel) into DuckDB, profiling data quality, running SQL analytics on local data, cleaning/transforming datasets, exporting results to various formats.
Don't use for: Building dbt models (use dbt-transforms with DuckDB adapter), cloud warehouse administration, pipeline scheduling (use data-pipelines), Python DataFrame transformations (use python-data-engineering).
Scope Constraints
- Local files only -- does not manage cloud warehouse connections or remote databases.
- No cloud warehouse administration (Snowflake, BigQuery, Redshift).
- Reference files loaded on demand -- do not pre-load multiple references.
- File paths must be validated before use; never execute user-provided paths without checking existence.
Model Routing
| reasoning_demand | preferred | acceptable | minimum |
|---|---|---|---|
| medium | Sonnet | Sonnet, Opus | Sonnet |
Core Principles
- Local-first -- Process data on the user's machine; no cloud credentials required.
- SQL-native -- Use SQL for all transformations; avoid unnecessary Python wrappers.
- Schema inference -- Let DuckDB auto-detect types; override only when inference fails.
- Zero config -- Start with
import duckdborduckdbCLI; no server setup. - Format-agnostic -- Read CSV, Parquet, JSON, Excel through unified SQL interface.
Default Engine Positioning
DuckDB is the default analytical engine for this skill suite. Assume DuckDB unless told otherwise.
When to target a warehouse instead:
- Data volume exceeds ~100GB or requires concurrent multi-user access
- Client provides warehouse access (Snowflake, BigQuery, Databricks)
- Production workload needs scheduling, monitoring, and SLA guarantees
- Regulatory requirements mandate data residency in a managed platform
When a warehouse is needed, DuckDB still serves as the local dev and validation target. Build and test locally, deploy to warehouse.
Core Usage Patterns
| Pattern | Description | When to Use |
|---|---|---|
| DuckDB as dbt target | dbt-duckdb adapter for local transforms |
Prototyping, client demos, small datasets |
| Standalone query engine | Direct SQL on files, no framework | Ad-hoc analysis, data profiling, one-off cleaning |
| DuckDB + DLT | DLT loads files into DuckDB | Development pipeline with destination swapping |
| Dev target, warehouse deploy | DuckDB locally, Snowflake/BigQuery in prod | Full engagement lifecycle |
Quick Start
pip install duckdb
Python
import duckdb
# Read a CSV and query it
result = duckdb.sql("SELECT * FROM read_csv_auto('data.csv') LIMIT 10")
result.show()
CLI
duckdb -c "SELECT count(*) FROM read_csv_auto('data.csv')"
Persistent Database
con = duckdb.connect("my_analysis.duckdb")
con.sql("CREATE TABLE customers AS SELECT * FROM read_csv_auto('customers.csv')")
con.sql("SELECT count(*) FROM customers").show()
Excel Files
INSTALL excel; LOAD excel;
SELECT * FROM read_xlsx('data.xlsx', sheet='Sheet1');
File Ingestion
| Format | Function | Auto-Detect | Glob Support | Notes |
|---|---|---|---|---|
| CSV/TSV | read_csv_auto() |
delimiter, header, types | Yes | Most common; handles most encodings |
| Parquet | read_parquet() |
schema from metadata | Yes | Fastest; preserves types perfectly |
| JSON | read_json_auto() |
structure, nesting | Yes | Handles NDJSON and arrays |
| Excel | read_xlsx() (excel ext) |
sheet, headers | No | Requires INSTALL excel; LOAD excel; (DuckDB >=1.2) |
Basic Ingestion
-- CSV with auto-detection
SELECT * FROM read_csv_auto('customers.csv');
-- Parquet with glob
SELECT * FROM read_parquet('data/events/*.parquet');
-- JSON (newline-delimited)
SELECT * FROM read_json_auto('logs.ndjson', format='newline_delimited');
-- Multiple CSVs, union by column name
SELECT * FROM read_csv_auto('data/*.csv', union_by_name=true);
For detailed ingestion options, see: File Ingestion Reference For Excel-specific patterns, see: Excel Specifics
Data Profiling Patterns
Column Statistics
SELECT
column_name
, column_type
, count
, null_percentage
, approx_unique
, min
, max
FROM (SUMMARIZE SELECT * FROM read_csv_auto('data.csv'));
NULL Rates and Cardinality
SELECT
count(*) as total_rows
, count(email) as email_present
, round(100.0 * count(email) / count(*), 1) as email_pct
, count(distinct email) as email_unique
, count(*) - count(distinct email) as email_duplicates
FROM read_csv_auto('customers.csv');
Value Distribution
-- Top 10 values for a column
SELECT status, count(*) as cnt, round(100.0 * count(*) / sum(count(*)) over (), 1) as pct
FROM read_csv_auto('orders.csv')
GROUP BY status
ORDER BY cnt DESC
LIMIT 10;
Cleaning in SQL
Common transformations executed directly in DuckDB:
CREATE TABLE customers_clean AS
SELECT
-- Dedup
* EXCLUDE (row_num)
FROM (
SELECT
*
, row_number() OVER (PARTITION BY email ORDER BY updated_at DESC) as row_num
FROM read_csv_auto('customers.csv')
)
WHERE row_num = 1;
-- Type casting and trimming
SELECT
trim(name) as name
, lower(trim(email)) as email
, try_cast(revenue as decimal(12,2)) as revenue
, strptime(date_str, '%m/%d/%Y')::date as order_date
, regexp_replace(phone, '[^0-9]', '', 'g') as phone_digits
FROM read_csv_auto('messy_data.csv');
Export Patterns
-- CSV
COPY (SELECT * FROM customers_clean) TO 'output/customers.csv' (HEADER, DELIMITER ',');
-- Parquet (recommended for downstream analytics)
COPY (SELECT * FROM customers_clean) TO 'output/customers.parquet' (FORMAT PARQUET);
-- JSON (newline-delimited)
COPY (SELECT * FROM customers_clean) TO 'output/customers.json' (FORMAT JSON);
For detailed export options, see: Export Patterns Reference
Performance Tips
| Setting | Command | Use When |
|---|---|---|
| Memory limit | SET memory_limit = '4GB'; |
Large datasets on constrained machines |
| Threads | SET threads = 4; |
Control parallelism |
| Temp directory | SET temp_directory = '/tmp/duckdb'; |
Spill to disk for large sorts |
| Progress bar | SET enable_progress_bar = true; |
Long-running queries |
File format performance: Parquet >> CSV >> JSON. Convert to Parquet early for repeated queries.
For detailed performance tuning, see: Performance Reference
Security Posture
See Security & Compliance Patterns for the full framework. See Consulting Security Tier Model for tier definitions.
| Capability | Tier 1 (Schema-Only) | Tier 2 (Sampled) | Tier 3 (Full Access) |
|---|---|---|---|
| Read local files | Metadata only | Anonymized samples | Full data |
| Write/export files | Schema DDL only | Sample outputs | Full exports |
| Profiling | Column types/names | Stats on samples | Full profiling |
| Cleaning transforms | Compile/validate SQL | Execute on samples | Execute on full data |
- Never hardcode file paths containing credentials or sensitive data in scripts.
- Never commit data files to version control; use
.gitignore. - Validate file paths exist before reading; reject paths outside the working directory.
Reference Files
Reference files loaded on demand:
- File Ingestion -- CSV, Parquet, JSON parameters, glob patterns, schema inference, common issues
- Excel Specifics -- Excel extension (read_xlsx), sheet selection, merged cells, date conversion, spatial fallback
- Export Patterns -- CSV/Parquet/JSON options, Python API export, partitioned writes
- Performance -- Memory limits, threads, EXPLAIN ANALYZE, pragma settings, format comparison