data-engineering-storage-remote-access-integrations-delta-lake
Originally fromlegout/data-platform-agent-skills
SKILL.md
Delta Lake on Cloud Storage
Integrating Delta Lake tables with cloud storage (S3, GCS, Azure) using the pure-Python deltalake package.
Installation
pip install deltalake pyarrow
Configuration Patterns
Method 1: storage_options (Recommended)
The simplest approach using dictionary-based configuration:
from deltalake import DeltaTable, write_deltalake
import pyarrow as pa
# S3 configuration
storage_options = {
"AWS_ACCESS_KEY_ID": "AKIA...",
"AWS_SECRET_ACCESS_KEY": "...",
"AWS_REGION": "us-east-1"
}
# Alternatively, use environment variables (preferred for production)
# os.environ['AWS_ACCESS_KEY_ID'], etc.
# Write Delta table
write_deltalake(
"s3://bucket/delta-table",
data=pa_table,
storage_options=storage_options,
mode="overwrite",
partition_by=["date"]
)
# Read Delta table
dt = DeltaTable(
"s3://bucket/delta-table",
storage_options=storage_options
)
df = dt.to_pandas()
GCS configuration:
storage_options = {
"GOOGLE_SERVICE_ACCOUNT_KEY_JSON": "/path/to/key.json"
# Or use env var GOOGLE_APPLICATION_CREDENTIALS
}
Azure configuration:
storage_options = {
"AZURE_STORAGE_CONNECTION_STRING": "...",
# OR: "AZURE_STORAGE_ACCOUNT_NAME" + "AZURE_STORAGE_ACCOUNT_KEY"
}
Method 2: PyArrow Filesystem (Advanced)
Use PyArrow filesystem objects for more control:
import pyarrow.fs as fs
from deltalake import write_deltalake, DeltaTable
# Create filesystem
raw_fs, subpath = fs.FileSystem.from_uri("s3://bucket/delta-table")
filesystem = fs.SubTreeFileSystem(subpath, raw_fs)
# Write
write_deltalake(
"delta-table", # relative to filesystem root
data=pa_table,
filesystem=filesystem,
mode="append"
)
# Read
dt = DeltaTable("delta-table", filesystem=filesystem)
Time Travel
from deltalake import DeltaTable
dt = DeltaTable("s3://bucket/delta-table")
# Load specific version
dt.load_version(5)
df_v5 = dt.to_pandas()
# Load by timestamp
dt.load_with_datetime("2024-01-01T12:00:00Z")
df_ts = dt.to_pandas()
# Get history
history = dt.history().to_pandas()
print(history[["version", "timestamp", "operation"]])
Maintenance Operations
# Vacuum old files (retention in hours)
dt.vacuum(retention_hours=24) # Clean files older than 24h
# Optimize compaction (combine small files)
dt.optimize().execute()
# Get file list
files = dt.files()
print(files) # List of Parquet files in the table
# Get metadata
details = dt.details()
print(details)
Incremental Processing
For change data capture (CDC) patterns:
from deltalake import DeltaTable
from datetime import datetime
dt = DeltaTable("s3://bucket/delta-table")
# Get changes since last checkpoint
last_version = get_checkpoint() # Your checkpoint tracking
# Read only added/modified files
changes = (
dt.history()
.filter(f"version > {last_version}")
.to_pyarrow_table()
)
# Or read full snapshot and compare
df = dt.to_pandas()
# ... compare with previous snapshot ...
# Update checkpoint
save_checkpoint(dt.version())
Best Practices
- ✅ Use environment variables for credentials in production (never hardcode)
- ✅ Partition tables by date/region for efficient querying
- ✅ Vacuum regularly to clean up old files (but retain enough for your time travel needs)
- ✅ Optimize periodically to compact small files
- ✅ Track versions for incremental processing using
dt.version()anddt.history() - ⚠️ Don't disable vacuum entirely - storage bloat
- ⚠️ Don't vacuum too aggressively - you'll lose time travel capability
Authentication
See @data-engineering-storage-authentication for detailed cloud auth patterns.
For S3:
- Environment:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION - IAM roles (EC2, ECS, Lambda) override env vars
- For S3-compatible (MinIO):
AWS_ENDPOINT_URLor instorage_options
Related
@data-engineering-storage-lakehouse/delta-lake- Delta Lake concepts and API@data-engineering-core- Using Delta with DuckDB@data-engineering-storage-lakehouse- Comparisons with Iceberg, Hudi
References
Weekly Installs
3
Repository
legout/data-agent-skillsFirst Seen
13 days ago
Security Audits
Installed on
opencode3
gemini-cli3
codebuddy3
github-copilot3
codex3
kimi-cli3