skills/legout/data-platform-agent-skills/data-engineering-storage-remote-access-integrations-delta-lake

data-engineering-storage-remote-access-integrations-delta-lake

SKILL.md

Delta Lake on Cloud Storage

Integrating Delta Lake tables with cloud storage (S3, GCS, Azure) using the pure-Python deltalake package.

Installation

pip install deltalake pyarrow

Configuration Patterns

Method 1: storage_options (Recommended)

The simplest approach using dictionary-based configuration:

from deltalake import DeltaTable, write_deltalake
import pyarrow as pa

# S3 configuration
storage_options = {
    "AWS_ACCESS_KEY_ID": "AKIA...",
    "AWS_SECRET_ACCESS_KEY": "...",
    "AWS_REGION": "us-east-1"
}
# Alternatively, use environment variables (preferred for production)
# os.environ['AWS_ACCESS_KEY_ID'], etc.

# Write Delta table
write_deltalake(
    "s3://bucket/delta-table",
    data=pa_table,
    storage_options=storage_options,
    mode="overwrite",
    partition_by=["date"]
)

# Read Delta table
dt = DeltaTable(
    "s3://bucket/delta-table",
    storage_options=storage_options
)
df = dt.to_pandas()

GCS configuration:

storage_options = {
    "GOOGLE_SERVICE_ACCOUNT_KEY_JSON": "/path/to/key.json"
    # Or use env var GOOGLE_APPLICATION_CREDENTIALS
}

Azure configuration:

storage_options = {
    "AZURE_STORAGE_CONNECTION_STRING": "...",
    # OR: "AZURE_STORAGE_ACCOUNT_NAME" + "AZURE_STORAGE_ACCOUNT_KEY"
}

Method 2: PyArrow Filesystem (Advanced)

Use PyArrow filesystem objects for more control:

import pyarrow.fs as fs
from deltalake import write_deltalake, DeltaTable

# Create filesystem
raw_fs, subpath = fs.FileSystem.from_uri("s3://bucket/delta-table")
filesystem = fs.SubTreeFileSystem(subpath, raw_fs)

# Write
write_deltalake(
    "delta-table",  # relative to filesystem root
    data=pa_table,
    filesystem=filesystem,
    mode="append"
)

# Read
dt = DeltaTable("delta-table", filesystem=filesystem)

Time Travel

from deltalake import DeltaTable

dt = DeltaTable("s3://bucket/delta-table")

# Load specific version
dt.load_version(5)
df_v5 = dt.to_pandas()

# Load by timestamp
dt.load_with_datetime("2024-01-01T12:00:00Z")
df_ts = dt.to_pandas()

# Get history
history = dt.history().to_pandas()
print(history[["version", "timestamp", "operation"]])

Maintenance Operations

# Vacuum old files (retention in hours)
dt.vacuum(retention_hours=24)  # Clean files older than 24h

# Optimize compaction (combine small files)
dt.optimize().execute()

# Get file list
files = dt.files()
print(files)  # List of Parquet files in the table

# Get metadata
details = dt.details()
print(details)

Incremental Processing

For change data capture (CDC) patterns:

from deltalake import DeltaTable
from datetime import datetime

dt = DeltaTable("s3://bucket/delta-table")

# Get changes since last checkpoint
last_version = get_checkpoint()  # Your checkpoint tracking

# Read only added/modified files
changes = (
    dt.history()
    .filter(f"version > {last_version}")
    .to_pyarrow_table()
)

# Or read full snapshot and compare
df = dt.to_pandas()
# ... compare with previous snapshot ...

# Update checkpoint
save_checkpoint(dt.version())

Best Practices

  1. Use environment variables for credentials in production (never hardcode)
  2. Partition tables by date/region for efficient querying
  3. Vacuum regularly to clean up old files (but retain enough for your time travel needs)
  4. Optimize periodically to compact small files
  5. Track versions for incremental processing using dt.version() and dt.history()
  6. ⚠️ Don't disable vacuum entirely - storage bloat
  7. ⚠️ Don't vacuum too aggressively - you'll lose time travel capability

Authentication

See @data-engineering-storage-authentication for detailed cloud auth patterns.

For S3:

  • Environment: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION
  • IAM roles (EC2, ECS, Lambda) override env vars
  • For S3-compatible (MinIO): AWS_ENDPOINT_URL or in storage_options

Related

  • @data-engineering-storage-lakehouse/delta-lake - Delta Lake concepts and API
  • @data-engineering-core - Using Delta with DuckDB
  • @data-engineering-storage-lakehouse - Comparisons with Iceberg, Hudi

References

Weekly Installs
6
First Seen
Feb 11, 2026
Installed on
pi6
mcpjam4
claude-code4
junie4
windsurf4
zencoder4