data-engineering-storage-remote-access-integrations-duckdb
DuckDB Remote Storage Integration
DuckDB provides multiple ways to access cloud storage (S3, GCS, Azure) from within the database.
HTTPFS Extension (Native)
The HTTPFS extension enables direct queries on remote files.
import duckdb
from contextlib import contextmanager
@contextmanager
def get_duckdb_connection():
"""Context manager ensures connection cleanup."""
con = duckdb.connect()
try:
con.execute("INSTALL httpfs; LOAD httpfs;")
yield con
finally:
con.close()
# Configure and query
with get_duckdb_connection() as con:
# S3 configuration
con.execute("""
SET s3_region='us-east-1';
SET s3_access_key_id='AKIA...';
SET s3_secret_access_key='...';
-- For temporary creds: SET s3_session_token='...'
-- For S3-compatible: SET s3_endpoint='http://minio:9000';
""")
# Query Parquet directly
df = con.sql("""
SELECT category, SUM(value) as total
FROM read_parquet('s3://bucket/data/*.parquet')
WHERE date >= '2024-01-01'
GROUP BY category
""").pl()
# Read from GCS (configure via environment or default credentials)
df = con.sql("SELECT * FROM read_csv('gs://bucket/data.csv')").pl()
Configuration via Environment Variables
Instead of hardcoding credentials, use environment variables:
import os
os.environ['AWS_ACCESS_KEY_ID'] = 'AKIA...'
os.environ['AWS_SECRET_ACCESS_KEY'] = '...'
os.environ['AWS_REGION'] = 'us-east-1'
# DuckDB HTTPFS reads these automatically on first use
import duckdb
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
df = con.sql("SELECT * FROM read_parquet('s3://bucket/data.parquet')").pl()
Via fsspec
Register fsspec filesystems for protocols DuckDB doesn't natively support:
import fsspec
import duckdb
# Register GCS (or any fsspec protocol)
duckdb.register_filesystem(fsspec.filesystem('gcs'))
# Now use gcs:// URIs natively
df = duckdb.sql("""
SELECT * FROM read_parquet('gcs://bucket/data.parquet')
""").pl()
Copy Operations
Copy data between DuckDB tables and cloud storage:
import duckdb
with duckdb.connect() as con:
# Export table to S3
con.sql("""
COPY (SELECT * FROM my_table)
TO 's3://bucket/output.parquet'
(FORMAT PARQUET)
""")
# Import from S3
con.sql("""
CREATE TABLE imported AS
SELECT * FROM read_parquet('s3://bucket/input.parquet')
""")
Delta Lake Integration
Read Delta tables from cloud storage:
import duckdb
with duckdb.connect() as con:
con.execute("INSTALL delta; LOAD delta;")
# Query Delta table
df = con.sql("""
SELECT * FROM delta_scan('s3://bucket/delta-table/')
WHERE date >= '2024-01-01'
""").pl()
# Time travel (read specific version)
df = con.sql("""
SELECT * FROM delta_scan('s3://bucket/delta-table/', version => 5)
""").pl()
Connection Management (FIXED)
# ✅ DO: Use context manager
with duckdb.connect("analytics.db") as con:
con.sql("CREATE TABLE ...")
# ❌ DON'T: Leak connections
con = duckdb.connect("analytics.db")
con.sql("...") # Never closed → leak
# ✅ DO: If you must, manually close
con = duckdb.connect("analytics.db")
try:
con.sql("...")
finally:
con.close()
Authentication
See @data-engineering-storage-authentication for S3, GCS, Azure patterns. DuckDB HTTPFS reads standard environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, etc.) automatically.
Performance Tips
- ✅ Predicate pushdown: Filter in SQL query, not after loading
- ✅ Column pruning: Only select needed columns
- ✅ Parquet format: Use Parquet (not CSV) for remote queries
- ✅ Partitioning: Store data partitioned (e.g., by date) for efficient queries
- ⚠️ Row group filtering: Parquet row groups enable scanning subsets
- ⚠️ Use
EXPLAINto verify pushdown:con.sql("EXPLAIN SELECT ...").pl()
References
- DuckDB HTTPFS Documentation
- DuckDB Delta Lake Extension
@data-engineering-core- DuckDB basics
More from legout/data-agent-skills
data-engineering-storage-remote-access-integrations-iceberg
Apache Iceberg catalog configuration for cloud storage (S3, GCS, Azure). Covers AWS Glue and REST catalogs, table scanning, and append/overwrite operations.
4data-science-eda
Exploratory Data Analysis (EDA): profiling, visualization, correlation analysis, and data quality checks. Use when understanding dataset structure, distributions, relationships, or preparing for feature engineering and modeling.
4data-engineering-storage-remote-access-libraries-fsspec
Comprehensive guide to fsspec: the universal filesystem interface for Python. Covers S3, GCS, Azure via s3fs, gcsfs, adlfs; protocol chaining, caching, async operations, and integration with the data ecosystem.
4data-engineering-storage-remote-access-libraries-pyarrow-fs
Native Arrow filesystem integration with PyArrow. Optimized for Parquet workflows, zero-copy data transfer, predicate pushdown, and column pruning. Covers S3, GCS, HDFS with PyArrow datasets.
4flowerpower
Create and manage data pipelines using the FlowerPower framework with Hamilton DAGs and uv. Lightweight orchestration for batch ETL, data transformation, and ML pipelines. Integrates with Delta Lake, DuckDB, Polars, and cloud storage.
4data-engineering-observability
Observability and monitoring for data pipelines using OpenTelemetry (traces) and Prometheus (metrics). Covers instrumentation, dashboards, and alerting.
4