data-engineering-storage-lakehouse
Lakehouse Formats
Lakehouse formats add ACID transactions, schema evolution, and time travel to data lakes stored on object storage (S3, GCS, Azure). This skill covers the three major open table formats: Delta Lake, Apache Iceberg, and Apache Hudi.
Quick Comparison
| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| ACID Transactions | ✅ | ✅ | ✅ |
| Time Travel | ✅ | ✅ | ✅ |
| Schema Evolution | ✅ | Advanced (branching) | ✅ |
| Primary Ecosystem | Spark/Databricks | Engine-agnostic | Spark (CDC focus) |
| Write Optimization | Copy-on-write | CoW, Merge-on-Read | CoW, Merge-on-Read |
| Python API | deltalake (pure), PySpark |
pyiceberg (pure) |
PySpark only |
| Best For | Spark ecosystems, Databricks | Multi-engine analytics | Change data capture, streaming |
When to Use Which?
- Delta Lake: You're in the Spark/Databricks ecosystem, need mature tooling with pure-Python
deltalakelibrary - Apache Iceberg: You need engine-agnostic tables (Spark, Trino, Flink, DuckDB), advanced schema branching
- Apache Hudi: You're building CDC pipelines from Kafka/DB logs, need upsert/delete support
Interoperability
Apache XTable and Delta UniForm (2024+) enable cross-format reads without conversion. Platforms like Databricks Unity Catalog and Snowflake support multiple formats natively, reducing vendor lock-in.
Related Skills
@data-engineering-storage-remote-access/integrations/delta-lake- Delta Lake on S3/GCS/Azure@data-engineering-storage-remote-access/integrations/iceberg- Iceberg with cloud catalogs@data-engineering-orchestration/dbt- dbt adapters for Delta/Iceberg@data-engineering-storage-remote-access- fsspec, PyArrow filesystem for cloud access
Skill Dependencies
This skill assumes familiarity with:
@data-engineering-core- Polars, DuckDB, PyArrow basics@data-engineering-storage-remote-access- Cloud storage access patterns
Detailed Guides
Delta Lake
See: @data-engineering-storage-lakehouse/delta-lake.md
- Pure-Python API (
deltalakepackage) - PySpark integration
- Time travel (version AsOf, timestamp AsOf)
- Schema evolution (add/drop/rename/upcast)
- Vacuum and optimize
- S3/GCS/Azure storage integration
Apache Iceberg
See: @data-engineering-storage-lakehouse/iceberg.md
- PyIceberg catalog abstraction (Hive, AWS Glue, REST)
- Schema evolution with branch support
- Partition evolution
- Time travel and versioning
- Local catalog for development
Apache Hudi
See: @data-engineering-storage-lakehouse/hudi.md
- Copy-on-write and Merge-on-Read modes
- CDC integration (Debezium, Kafka)
- Hoodie tables, indexes, bloom filters
- Querying via Spark
Common Patterns
Time Travel Queries
# Delta Lake
from deltalake import DeltaTable
dt = DeltaTable("s3://bucket/delta-table")
dt.load_version(5) # Load specific version
df = dt.to_pandas()
# Iceberg
table = catalog.load_table("db.table")
df = table.scan(as_of_timestamp="2024-01-01T00:00:00Z").to_pandas()
Schema Evolution
# Delta Lake (auto-evolves on write by default)
dt = DeltaTable("s3://bucket/delta-table")
# When writing with new column, Delta adds it automatically
# Iceberg (explicit)
with table.update_schema() as update:
update.add_column("new_field", StringType(), required=False)
Incremental Processing
# Read only changes since last checkpoint
delta_table = DeltaTable("s3://bucket/delta-table")
last_version = get_last_processed_version()
# Get changes as Arrow table
changes = (
delta_table
.history() # Get commit history
.filter(f"version > {last_version}")
.to_pyarrow_table()
)
Best Practices
- Use Partitions: Partition by date/region to enable predicate pushdown
- Vacuum Regularly: Clean up old files to avoid storage bloat (Delta:
vacuum(), Iceberg:expire_snapshots()) - Optimize Layouts: Compaction for small files (Delta:
OPTIMIZE, Iceberg:rewrite_data_files()) - Catalog Choice: Use AWS Glue for AWS, Hive Metastore for on-prem, REST for SaaS
- Transaction Size: Batch writes for throughput but avoid too-large transactions
- Monitor Table Metadata: Table metadata grows with operations; archive old versions
References
More from legout/data-agent-skills
data-engineering
Comprehensive data engineering skill suite covering core libraries (Polars, DuckDB, PyArrow), lakehouse formats, cloud storage, orchestration, streaming, quality, observability, and AI/ML pipelines.
5data-engineering-storage-remote-access-libraries-obstore
High-performance Rust-based remote filesystem library. Covers store creation, basic operations, async API, streaming uploads, Arrow integration, and fsspec compatibility wrapper.
4data-engineering-storage-remote-access-integrations-iceberg
Apache Iceberg catalog configuration for cloud storage (S3, GCS, Azure). Covers AWS Glue and REST catalogs, table scanning, and append/overwrite operations.
4data-science-eda
Exploratory Data Analysis (EDA): profiling, visualization, correlation analysis, and data quality checks. Use when understanding dataset structure, distributions, relationships, or preparing for feature engineering and modeling.
4data-science-notebooks
Interactive notebooks for data science: Jupyter, JupyterLab, and marimo. Use for exploratory analysis, reproducible research, documentation, and sharing insights with stakeholders.
4data-engineering-storage-remote-access-libraries-fsspec
Comprehensive guide to fsspec: the universal filesystem interface for Python. Covers S3, GCS, Azure via s3fs, gcsfs, adlfs; protocol chaining, caching, async operations, and integration with the data ecosystem.
4