data-engineering-storage-remote-access-integrations-delta-lake

Installation

SKILL.md

Delta Lake on Cloud Storage

Integrating Delta Lake tables with cloud storage (S3, GCS, Azure) using the pure-Python deltalake package.

Installation

pip install deltalake pyarrow

Configuration Patterns

Method 1: storage_options (Recommended)

The simplest approach using dictionary-based configuration:

from deltalake import DeltaTable, write_deltalake
import pyarrow as pa

# S3 configuration
storage_options = {
    "AWS_ACCESS_KEY_ID": "AKIA...",
    "AWS_SECRET_ACCESS_KEY": "...",
    "AWS_REGION": "us-east-1"
}
# Alternatively, use environment variables (preferred for production)
# os.environ['AWS_ACCESS_KEY_ID'], etc.

# Write Delta table
write_deltalake(
    "s3://bucket/delta-table",
    data=pa_table,
    storage_options=storage_options,
    mode="overwrite",
    partition_by=["date"]
)

# Read Delta table
dt = DeltaTable(
    "s3://bucket/delta-table",
    storage_options=storage_options
)
df = dt.to_pandas()

GCS configuration:

storage_options = {
    "GOOGLE_SERVICE_ACCOUNT_KEY_JSON": "/path/to/key.json"
    # Or use env var GOOGLE_APPLICATION_CREDENTIALS
}

Azure configuration:

storage_options = {
    "AZURE_STORAGE_CONNECTION_STRING": "...",
    # OR: "AZURE_STORAGE_ACCOUNT_NAME" + "AZURE_STORAGE_ACCOUNT_KEY"
}

Method 2: PyArrow Filesystem (Advanced)

Use PyArrow filesystem objects for more control:

import pyarrow.fs as fs
from deltalake import write_deltalake, DeltaTable

# Create filesystem
raw_fs, subpath = fs.FileSystem.from_uri("s3://bucket/delta-table")
filesystem = fs.SubTreeFileSystem(subpath, raw_fs)

# Write
write_deltalake(
    "delta-table",  # relative to filesystem root
    data=pa_table,
    filesystem=filesystem,
    mode="append"
)

# Read
dt = DeltaTable("delta-table", filesystem=filesystem)

Time Travel

from deltalake import DeltaTable

dt = DeltaTable("s3://bucket/delta-table")

# Load specific version
dt.load_version(5)
df_v5 = dt.to_pandas()

# Load by timestamp
dt.load_with_datetime("2024-01-01T12:00:00Z")
df_ts = dt.to_pandas()

# Get history
history = dt.history().to_pandas()
print(history[["version", "timestamp", "operation"]])

Maintenance Operations

# Vacuum old files (retention in hours)
dt.vacuum(retention_hours=24)  # Clean files older than 24h

# Optimize compaction (combine small files)
dt.optimize().execute()

# Get file list
files = dt.files()
print(files)  # List of Parquet files in the table

# Get metadata
details = dt.details()
print(details)

Incremental Processing

For change data capture (CDC) patterns:

from deltalake import DeltaTable
from datetime import datetime

dt = DeltaTable("s3://bucket/delta-table")

# Get changes since last checkpoint
last_version = get_checkpoint()  # Your checkpoint tracking

# Read only added/modified files
changes = (
    dt.history()
    .filter(f"version > {last_version}")
    .to_pyarrow_table()
)

# Or read full snapshot and compare
df = dt.to_pandas()
# ... compare with previous snapshot ...

# Update checkpoint
save_checkpoint(dt.version())

Best Practices

✅ Use environment variables for credentials in production (never hardcode)
✅ Partition tables by date/region for efficient querying
✅ Vacuum regularly to clean up old files (but retain enough for your time travel needs)
✅ Optimize periodically to compact small files
✅ Track versions for incremental processing using dt.version() and dt.history()
⚠️ Don't disable vacuum entirely - storage bloat
⚠️ Don't vacuum too aggressively - you'll lose time travel capability

Authentication

See @data-engineering-storage-authentication for detailed cloud auth patterns.

For S3:

Environment: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION
IAM roles (EC2, ECS, Lambda) override env vars
For S3-compatible (MinIO): AWS_ENDPOINT_URL or in storage_options

@data-engineering-storage-lakehouse/delta-lake - Delta Lake concepts and API
@data-engineering-core - Using Delta with DuckDB
@data-engineering-storage-lakehouse - Comparisons with Iceberg, Hudi

References

Related skills

More from legout/data-platform-agent-skills

Installs

Repository

legout/data-pla…t-skills

First Seen

Feb 11, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykFail

data-engineering-storage-remote-access-integrations-delta-lake

Delta Lake on Cloud Storage

Installation

Configuration Patterns

Method 1: storage_options (Recommended)

Method 2: PyArrow Filesystem (Advanced)

Time Travel

Maintenance Operations

Incremental Processing

Best Practices

Authentication

Related

References

More from legout/data-platform-agent-skills

data-science-eda

data-science-visualization

data-engineering-core

data-science-feature-engineering

data-science-notebooks

data-engineering-best-practices