duckdb
DuckDB Local Skill for Claude
Local-first SQL analytics on files. Read, profile, clean, and export data without a warehouse.
When to Use This Skill
Activate when: Reading local files (CSV, Parquet, JSON, Excel) into DuckDB, profiling data quality, running SQL analytics on local data, cleaning/transforming datasets, exporting results to various formats.
Don't use for: Building dbt models (use dbt-transforms with DuckDB adapter), cloud warehouse administration, pipeline scheduling (use data-pipelines), Python DataFrame transformations (use python-data-engineering).
Scope Constraints
- Local files only -- does not manage cloud warehouse connections or remote databases.
- No cloud warehouse administration (Snowflake, BigQuery, Redshift).
- Reference files loaded on demand -- do not pre-load multiple references.
- File paths must be validated before use; never execute user-provided paths without checking existence.
Model Routing
| reasoning_demand | preferred | acceptable | minimum |
|---|---|---|---|
| medium | Sonnet | Sonnet, Opus | Sonnet |
Core Principles
- Local-first -- Process data on the user's machine; no cloud credentials required.
- SQL-native -- Use SQL for all transformations; avoid unnecessary Python wrappers.
- Schema inference -- Let DuckDB auto-detect types; override only when inference fails.
- Zero config -- Start with
import duckdborduckdbCLI; no server setup. - Format-agnostic -- Read CSV, Parquet, JSON, Excel through unified SQL interface.
Default Engine Positioning
DuckDB is the default analytical engine for this skill suite. Assume DuckDB unless told otherwise.
When to target a warehouse instead:
- Data volume exceeds ~100GB or requires concurrent multi-user access
- Client provides warehouse access (Snowflake, BigQuery, Databricks)
- Production workload needs scheduling, monitoring, and SLA guarantees
- Regulatory requirements mandate data residency in a managed platform
When a warehouse is needed, DuckDB still serves as the local dev and validation target. Build and test locally, deploy to warehouse.
Core Usage Patterns
| Pattern | Description | When to Use |
|---|---|---|
| DuckDB as dbt target | dbt-duckdb adapter for local transforms |
Prototyping, client demos, small datasets |
| Standalone query engine | Direct SQL on files, no framework | Ad-hoc analysis, data profiling, one-off cleaning |
| DuckDB + DLT | DLT loads files into DuckDB | Development pipeline with destination swapping |
| Dev target, warehouse deploy | DuckDB locally, Snowflake/BigQuery in prod | Full engagement lifecycle |
Quick Start
pip install duckdb
Python
import duckdb
# Read a CSV and query it
result = duckdb.sql("SELECT * FROM read_csv_auto('data.csv') LIMIT 10")
result.show()
CLI
duckdb -c "SELECT count(*) FROM read_csv_auto('data.csv')"
Persistent Database
con = duckdb.connect("my_analysis.duckdb")
con.sql("CREATE TABLE customers AS SELECT * FROM read_csv_auto('customers.csv')")
con.sql("SELECT count(*) FROM customers").show()
Excel Files
INSTALL excel; LOAD excel;
SELECT * FROM read_xlsx('data.xlsx', sheet='Sheet1');
File Ingestion
| Format | Function | Auto-Detect | Glob Support | Notes |
|---|---|---|---|---|
| CSV/TSV | read_csv_auto() |
delimiter, header, types | Yes | Most common; handles most encodings |
| Parquet | read_parquet() |
schema from metadata | Yes | Fastest; preserves types perfectly |
| JSON | read_json_auto() |
structure, nesting | Yes | Handles NDJSON and arrays |
| Excel | read_xlsx() (excel ext) |
sheet, headers | No | Requires INSTALL excel; LOAD excel; (DuckDB >=1.2) |
Basic Ingestion
-- CSV with auto-detection
SELECT * FROM read_csv_auto('customers.csv');
-- Parquet with glob
SELECT * FROM read_parquet('data/events/*.parquet');
-- JSON (newline-delimited)
SELECT * FROM read_json_auto('logs.ndjson', format='newline_delimited');
-- Multiple CSVs, union by column name
SELECT * FROM read_csv_auto('data/*.csv', union_by_name=true);
For detailed ingestion options, see: File Ingestion Reference For Excel-specific patterns, see: Excel Specifics
Data Profiling Patterns
Column Statistics
SELECT
column_name
, column_type
, count
, null_percentage
, approx_unique
, min
, max
FROM (SUMMARIZE SELECT * FROM read_csv_auto('data.csv'));
NULL Rates and Cardinality
SELECT
count(*) as total_rows
, count(email) as email_present
, round(100.0 * count(email) / count(*), 1) as email_pct
, count(distinct email) as email_unique
, count(*) - count(distinct email) as email_duplicates
FROM read_csv_auto('customers.csv');
Value Distribution
-- Top 10 values for a column
SELECT status, count(*) as cnt, round(100.0 * count(*) / sum(count(*)) over (), 1) as pct
FROM read_csv_auto('orders.csv')
GROUP BY status
ORDER BY cnt DESC
LIMIT 10;
Cleaning in SQL
Common transformations executed directly in DuckDB:
CREATE TABLE customers_clean AS
SELECT
-- Dedup
* EXCLUDE (row_num)
FROM (
SELECT
*
, row_number() OVER (PARTITION BY email ORDER BY updated_at DESC) as row_num
FROM read_csv_auto('customers.csv')
)
WHERE row_num = 1;
-- Type casting and trimming
SELECT
trim(name) as name
, lower(trim(email)) as email
, try_cast(revenue as decimal(12,2)) as revenue
, strptime(date_str, '%m/%d/%Y')::date as order_date
, regexp_replace(phone, '[^0-9]', '', 'g') as phone_digits
FROM read_csv_auto('messy_data.csv');
Export Patterns
-- CSV
COPY (SELECT * FROM customers_clean) TO 'output/customers.csv' (HEADER, DELIMITER ',');
-- Parquet (recommended for downstream analytics)
COPY (SELECT * FROM customers_clean) TO 'output/customers.parquet' (FORMAT PARQUET);
-- JSON (newline-delimited)
COPY (SELECT * FROM customers_clean) TO 'output/customers.json' (FORMAT JSON);
For detailed export options, see: Export Patterns Reference
Performance Tips
| Setting | Command | Use When |
|---|---|---|
| Memory limit | SET memory_limit = '4GB'; |
Large datasets on constrained machines |
| Threads | SET threads = 4; |
Control parallelism |
| Temp directory | SET temp_directory = '/tmp/duckdb'; |
Spill to disk for large sorts |
| Progress bar | SET enable_progress_bar = true; |
Long-running queries |
File format performance: Parquet >> CSV >> JSON. Convert to Parquet early for repeated queries.
For detailed performance tuning, see: Performance Reference
Security Posture
See Security & Compliance Patterns for the full framework. See Consulting Security Tier Model for tier definitions.
| Capability | Tier 1 (Schema-Only) | Tier 2 (Sampled) | Tier 3 (Full Access) |
|---|---|---|---|
| Read local files | Metadata only | Anonymized samples | Full data |
| Write/export files | Schema DDL only | Sample outputs | Full exports |
| Profiling | Column types/names | Stats on samples | Full profiling |
| Cleaning transforms | Compile/validate SQL | Execute on samples | Execute on full data |
- Never hardcode file paths containing credentials or sensitive data in scripts.
- Never commit data files to version control; use
.gitignore. - Validate file paths exist before reading; reject paths outside the working directory.
Reference Files
Reference files loaded on demand:
- File Ingestion -- CSV, Parquet, JSON parameters, glob patterns, schema inference, common issues
- Excel Specifics -- Excel extension (read_xlsx), sheet selection, merged cells, date conversion, spatial fallback
- Export Patterns -- CSV/Parquet/JSON options, Python API export, partitioned writes
- Performance -- Memory limits, threads, EXPLAIN ANALYZE, pragma settings, format comparison
More from dtsong/data-engineering-skills
data-observability
Use this skill when implementing monitoring, alerting, and incident response for data pipelines. Covers freshness monitoring, volume anomaly detection, schema change detection, alerting patterns, and incident response workflows. Common phrases: \"data freshness\", \"pipeline monitoring\", \"data anomaly\", \"schema drift\", \"data alerting\", \"incident response\", \"data observability\", \"stale data\". Do NOT use for writing dbt models (use dbt-transforms), pipeline scheduling (use data-pipelines), or data quality testing as deliverables (use data-testing).
3data-governance
Use this skill when implementing data governance as part of engineering work. Covers data cataloging (dbt docs, external tools), lineage documentation, data classification (PII/PHI taxonomy), access control patterns (RBAC, row-level security), and compliance frameworks (GDPR, HIPAA, SOX, CCPA). Common phrases: \"data catalog\", \"data lineage\", \"PII classification\", \"access control\", \"RBAC\", \"data governance\", \"compliance requirements\". Do NOT use for writing dbt models (use dbt-transforms), pipeline orchestration (use data-pipelines), or data quality testing (use data-testing).
2dlt-extract
Use this skill when building DLT pipelines for file-based or consulting data extraction. Covers Excel/CSV/SharePoint ingestion via DLT, destination swapping (DuckDB dev to warehouse prod), schema contracts for cleaning, and portable pipeline patterns. Common phrases: \"dlt pipeline for files\", \"extract Excel with dlt\", \"portable data pipeline\", \"dlt filesystem source\". Do NOT use for core DLT concepts like REST API or SQL database sources (use data-integration) or pipeline scheduling (use data-pipelines).
2data-testing
Use this skill when designing testing strategies for data pipelines, writing SQL assertions, validating pipeline output, or packaging tests as client deliverables. Covers dbt test patterns, pipeline validation, SQL assertion libraries, test coverage targets, and test-as-deliverable packaging. Common phrases: \"data testing strategy\", \"pipeline validation\", \"SQL assertions\", \"test coverage\", \"test as deliverable\", \"data quality tests\". Do NOT use for writing dbt models (use dbt-transforms), DuckDB analytical queries (use duckdb), or pipeline scheduling (use data-pipelines).
2event-streaming
Use this skill when building real-time or near-real-time data pipelines. Covers Kafka, Flink, Spark Streaming, Snowpipe, BigQuery streaming, materialized views, and batch-vs-streaming decisions. Common phrases: \"real-time pipeline\", \"Kafka consumer\", \"streaming vs batch\", \"low latency ingestion\". Do NOT use for batch integration patterns (use data-integration) or pipeline orchestration (use data-pipelines).
2client-delivery
Use this skill when managing a consulting data cleaning engagement. Covers engagement setup, schema profiling, security tier selection, project scaffolding, deliverable generation, and client handoff. Common phrases: \"set up a cleaning project\", \"profile this schema\", \"data cleaning engagement\", \"generate deliverables\", \"client handoff\". Do NOT use for writing dbt models (use dbt-transforms), DuckDB queries (use duckdb), or pipeline orchestration (use data-pipelines).
2