designing-data-storage
SKILL.md
Designing Data Storage
Comprehensive guide to selecting and using modern data storage formats for analytics and machine learning. Covers six file formats (Parquet, Arrow, Lance, Zarr, Avro, ORC) and three lakehouse table formats (Delta Lake, Apache Iceberg, Apache Hudi).
Table of Contents
- Quick Format Comparison
- When to Use Which?
- Format Selection Matrix
- Detailed Reference Guides
- Code Examples
- Best Practices
- Compression Codec Comparison
- Related Skills
- References
Quick Format Comparison
File Formats
| Format | Type | Best For | Compression | Schema Evolution | Random Access |
|---|---|---|---|---|---|
| Parquet | Columnar | Analytics, data lakes | ✅ (Snappy, Zstd, LZ4) | ✅ (add/drop) | ✅ (row groups) |
| Arrow/Feather | Columnar | In-memory, IPC, ML | ✅ (LZ4, Zstd) | Limited | ✅ (record batches) |
| Lance | Columnar | ML pipelines, vectors | ✅ (Zstd, LZ4) | ✅ | ✅ (multi-modal) |
| Zarr | Chunked arrays | ML, geospatial, N-dim | ✅ (Blosc, gzip) | ✅ (chunks) | ✅ (chunk-level) |
| Avro | Row-based | Streaming, Kafka | ✅ (deflate, snappy) | ✅ (full schema) | ❌ (sequential) |
| ORC | Columnar | Hive, Hadoop | ✅ (ZLIB, Snappy) | Limited | ✅ (stripe-level) |
Lakehouse Table Formats
| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| ACID Transactions | ✅ | ✅ | ✅ |
| Time Travel | ✅ | ✅ | ✅ |
| Schema Evolution | ✅ | Advanced (branching) | ✅ |
| Primary Ecosystem | Spark/Databricks | Engine-agnostic | Spark (CDC focus) |
| Write Optimization | Copy-on-write | CoW, Merge-on-Read | CoW, Merge-on-Read |
| Python API | deltalake (pure), PySpark |
pyiceberg (pure) |
PySpark only |
| Best For | Spark ecosystems | Multi-engine analytics | Change data capture |
When to Use Which?
File Formats
Choose Parquet when:
- You need broad compatibility (Spark, DuckDB, Polars, pandas)
- Analytics queries with filtering/aggregation
- Data lake storage with partitioning
- Mature ecosystem with best compression support
Choose Arrow/Feather when:
- Zero-copy sharing between processes (IPC)
- Fast serialization/deserialization for ML training
- In-memory format persistence
- Need Arrow ecosystem (Kernel, CUDA, etc.)
Choose Lance when:
- Machine learning pipelines with embeddings/vectors
- Need multi-modal data (text + images + audio + vectors)
- Versioned datasets with Git-like branching
- Cloud-native (S3/GCS) with no metadata catalog required
Choose Zarr when:
- N-dimensional arrays (tensors, satellite imagery, medical scans)
- Chunked, compressed storage for ML
- Parallel reads/writes across chunks
- Cloud-optimized (s3://, gs:// with fsspec)
Choose Avro when:
- Streaming to Kafka/Kinesis
- Schema evolution is critical (backward/forward)
- Row-based access pattern
- Need to serialize objects/records
Choose ORC when:
- Working primarily with Hive/Hadoop
- Hive ACID transactions
- Legacy big data pipelines
Lakehouse Table Formats
Choose Delta Lake when:
- You're in the Spark/Databricks ecosystem
- Need mature tooling with pure-Python
deltalakelibrary - Want simplest time travel and schema evolution
Choose Apache Iceberg when:
- You need engine-agnostic tables (Spark, Trino, Flink, DuckDB)
- Advanced schema branching and partition evolution
- Prefer explicit catalog abstraction
Choose Apache Hudi when:
- You're building CDC pipelines from Kafka/DB logs
- Need upsert/delete support with streaming
- Low-latency incremental processing
Format Selection Matrix
| Use Case | Recommended Format | Reason |
|---|---|---|
| Data lake analytics | Parquet or Delta Lake | Mature, partitioning, ecosystem |
| ML training data | Arrow/Feather or Lance | Zero-copy, vector support |
| Geospatial arrays | Zarr | Chunked, N-dimensional, cloud-optimized |
| Streaming/Kafka | Avro | Schema evolution, row-based |
| Legacy Hive | ORC | Compatibility |
| Feature stores | Lance or Delta Lake | Versioning, vectors |
| IPC between processes | Arrow IPC or Feather | Zero-copy, fast |
| Multi-engine analytics | Apache Iceberg | Engine-agnostic, catalog flexibility |
| CDC pipelines | Apache Hudi | Built for upserts/streaming |
| Quick exports | Parquet (Zstd) | Good compression/decompression speed |
Detailed Reference Guides
File Format References
references/parquet.md- Deep dive on Parquet layout, writer settings, partitioning patternsreferences/format-selection-guide.md- Comprehensive decision matrix and selection criteria
Lakehouse Table Format References
references/delta-lake.md- Pure-Python API, PySpark integration, time travel, cloud storagereferences/iceberg.md- Catalog configuration, schema evolution, partition evolutionreferences/hudi.md- CDC patterns, upserts, Copy-on-Write vs Merge-on-Read
Code Examples
Parquet (Columnar Analytics)
import polars as pl
import pyarrow.parquet as pq
import pyarrow as pa
# Write with Polars
df = pl.DataFrame({"id": [1, 2, 3], "value": [100.0, 200.0, 150.0]})
df.write_parquet("data.parquet", compression="zstd")
# Write with PyArrow (more control)
table = pa.Table.from_pandas(df)
pq.write_table(
table,
"data.parquet",
compression="ZSTD",
compression_level=3,
row_group_size=100000,
use_dictionary=True
)
# Read with column pruning
df = pl.read_parquet("data.parquet", columns=["id", "value"])
# Dataset scanning with predicate pushdown
lazy_df = pl.scan_parquet("s3://bucket/dataset/**/*.parquet")
result = lazy_df.filter(pl.col("value") > 100).collect()
Delta Lake (ACID Table Format)
from deltalake import DeltaTable, write_deltalake
import pyarrow as pa
# Create Delta table
data = pa.table({
"id": [1, 2, 3],
"value": [100.0, 200.0, 150.0],
"timestamp": ["2024-01-01", "2024-01-02", "2024-01-03"]
})
write_deltalake("data/delta-table", data, mode="overwrite")
# Read with time travel
dt = DeltaTable("data/delta-table")
dt.load_version(1) # Load specific version
df = dt.to_pandas()
# Upsert (merge)
new_data = pa.table({"id": [2, 4], "value": [250.0, 400.0]})
dt.merge(
source=new_data,
predicate="target.id = source.id"
).when_matched_update_all().when_not_matched_insert_all().execute()
Apache Iceberg (Engine-Agnostic)
from pyiceberg.catalog import load_catalog
# Load catalog
catalog = load_catalog("glue", **{
"type": "glue",
"s3.region": "us-east-1",
})
# Create and write table
table = catalog.create_table("default.events", schema=schema)
table.append(parquet_data)
# Read with time travel
df = table.scan(
row_filter="population > 1000000",
as_of_timestamp="2024-01-01T00:00:00Z"
).to_pandas()
Lance (ML-Native)
import lancedb
import polars as pl
# Create Lance dataset
db = lancedb.connect("./data.lance")
df = pl.DataFrame({
"id": [1, 2, 3],
"text": ["Hello", "World", "ML"],
"vector": [[0.1] * 128, [0.2] * 128, [0.3] * 128]
})
table = db.create_table("my_table", df)
# Vector search
results = table.search([0.1] * 128).limit(5).to_pandas()
Zarr (N-Dimensional Arrays)
import zarr
import numpy as np
# Create chunked, compressed array
z = zarr.open(
'data.zarr',
mode='w',
shape=(1000000, 1000),
chunks=(10000, 1000),
dtype='f4',
compressor=zarr.Blosc(cname='zstd', clevel=3)
)
# Write and read partial (only loads needed chunks)
z[:10000, :] = np.random.rand(10000, 1000).astype('f4')
slice_data = z[5000:6000, :]
Best Practices
File Formats
- ✅ Default to Parquet - Broadest compatibility, good compression
- ✅ Use Arrow/Feather for ML staging - Zero-copy between frameworks
- ✅ Use Lance when vectors are first-class - No separate vector DB needed
- ✅ Use Zarr for N-dim arrays - Geospatial, video, 3D data
- ✅ Compress everything - Snappy (fast) or Zstd (balanced)
- ✅ Partition wisely - By date/region/tenant to enable pruning
- ❌ Don't use Avro for analytics - No column pruning, row-based
- ❌ Don't use ORC unless in Hive/Hadoop world
Lakehouse Formats
- ✅ Use partitions - Partition by date/region for predicate pushdown
- ✅ Vacuum regularly - Clean up old files to avoid storage bloat
- ✅ Optimize layouts - Compact small files for better performance
- ✅ Catalog choice - AWS Glue for AWS, Hive for on-prem, REST for SaaS
- ✅ Monitor metadata - Table metadata grows with operations
- ❌ Don't vacuum too aggressively - Keep time travel window
- ❌ Don't skip OPTIMIZE - Small files hurt performance
Compression Codec Comparison
| Codec | Compression Ratio | Speed | Best For |
|---|---|---|---|
| Snappy | Low (~2:1) | ⚡⚡⚡ Fast | Fast analytics, default |
| Zstd | Medium-High (~4:1) | ⚡⚡ Fast | General purpose |
| LZ4 | Low-Medium (~2.5:1) | ⚡⚡⚡ Very fast | Real-time streaming |
| Gzip | High (~5:1) | ⚡ Slow | Archival, cold storage |
| Blosc (zstd) | Medium | ⚡⚡ | Zarr arrays |
Related Skills
@accessing-cloud-storage- S3, GCS, Azure authentication and I/O patterns@building-data-pipelines- ETL patterns using these storage formats@engineering-ai-pipelines- Vector storage, Lance, embeddings workflows@managing-data-catalogs- Hive, Glue, REST catalogs for Iceberg
References
Weekly Installs
1
Repository
legout/data-agent-skillsFirst Seen
3 days ago
Security Audits
Installed on
mcpjam1
claude-code1
replit1
junie1
windsurf1
zencoder1