data-engineering-storage-formats
Data Storage Formats
Comprehensive guide to modern data serialization formats for analytics and machine learning: Parquet, Apache Arrow, Lance, Zarr, Avro, and ORC. Learn compression tradeoffs, partitioning strategies, and when to use each format.
Quick Comparison
| Format | Type | Best For | Compression | Schema Evolution | Random Access |
|---|---|---|---|---|---|
| Parquet | Columnar | Analytics, data lakes | ✅ (Snappy, Zstd, LZ4) | ✅ (add/drop) | ✅ (row groups) |
| Arrow/Feather | Columnar | In-memory, IPC, ML | ✅ (LZ4, Zstd) | Limited | ✅ (record batches) |
| Lance | Columnar | ML pipelines, vectors | ✅ (Zstd, LZ4) | ✅ | ✅ (multi-modal) |
| Zarr | Chunked arrays | ML, geospatial, N-dim | ✅ (Blosc, gzip) | ✅ (chunks) | ✅ (chunk-level) |
| Avro | Row-based | Streaming, Kafka | ✅ (deflate, snappy) | ✅ (full schema) | ❌ (sequential) |
| ORC | Columnar | Hive, Hadoop | ✅ (ZLIB, Snappy) | Limited | ✅ (stripe-level) |
When to Use Which?
Choose Parquet when:
- You need broad compatibility (Spark, DuckDB, Polars, pandas)
- Analytics queries with filtering/aggregation
- Data lake storage with partitioning
- Mature ecosystem with best compression support
Choose Arrow/Feather when:
- Zero-copy sharing between processes (IPC)
- Fast serialization/deserialization for ML training
- In-memory format persistence
- Need Arrow ecosystem (Kernel, CUDA, etc.)
Choose Lance when:
- Machine learning pipelines with embeddings/vectors
- Need multi-modal data (text + images + audio + vectors)
- Versioned datasets with Git-like branching
- Cloud-native (S3/GCS) with no metadata catalog required
Choose Zarr when:
- N-dimensional arrays (tensors, satellite imagery, medical scans)
- Chunked, compressed storage for ML
- Parallel reads/writes across chunks
- Cloud-optimized (s3://, gs:// with fsspec)
Choose Avro when:
- Streaming to Kafka/Kinesis
- Schema evolution is critical (backward/forward)
- Row-based access pattern
- Need to serialize objects/records
Choose ORC when:
- Working primarily with Hive/Hadoop
- Hive ACID transactions
- Legacy big data pipelines
Skill Dependencies
@data-engineering-core- Polars/DuckDB to read/write these formats@data-engineering-storage-remote-access- Cloud storage backends@data-engineering-storage-lakehouse- Table formats (Delta/Iceberg/Hudi) built on these
Detailed Guides
Parquet
See: parquet.md (detailed deep dive)
Parquet is the de facto standard for columnar analytics storage.
import polars as pl
import pyarrow.parquet as pq
import pyarrow as pa
# Write with Polars
df = pl.DataFrame({"id": [1, 2, 3], "value": [100.0, 200.0, 150.0]})
df.write_parquet("data.parquet", compression="zstd")
# Write with PyArrow (more control)
table = pa.Table.from_pandas(df)
pq.write_table(
table,
"data.parquet",
compression="ZSTD",
compression_level=3,
row_group_size=100000, # Target rows per row group
use_dictionary=True # Dictionary encoding for strings
)
# Read with column pruning
df = pl.read_parquet("data.parquet", columns=["id", "value"])
# Dataset scanning with predicate pushdown
lazy_df = pl.scan_parquet("s3://bucket/dataset/**/*.parquet")
result = lazy_df.filter(pl.col("value") > 100).collect()
Key concepts:
- Row groups: Horizontal partitioning, enables skipping files
- Column chunks: Within each row group, each column stored separately
- Pages: Smallest unit (within column chunks), enables column pruning
- Statistics: min/max/null count for predicate pushdown
- Dictionary encoding: For low-cardinality strings
Apache Arrow (Feather/IPC)
Arrow is an in-memory columnar format. Feather (v1/v2) and IPC are on-disk/serialization formats.
import pyarrow as pa
import polars as pl
# Create Arrow table
table = pa.table({
"id": [1, 2, 3],
"value": [100.0, 200.0, 150.0],
"category": ["A", "B", "A"]
})
# Write Feather file (Arrow IPC on disk)
pa.feather.write_feather(table, "data.feather")
# Read back
table2 = pa.feather.read_table("data.feather")
# Polars integration (zero-copy)
df = pl.from_arrow(table) # No copy!
df.write_ipc("data.ipc") # IPC format (stream or file)
# Arrow Flight RPC (network streaming)
from pyarrow.flight import FlightClient, FlightDescriptor
client = FlightClient("grpc+tcp://localhost:5005")
reader = client.do_get(descriptor)
table = reader.read_all()
When to use Arrow/Feather:
- Fast serialization for ML (TensorFlow, PyTorch)
- Inter-process communication (shared memory, files)
- Zero-copy between Polars/Pandas/PyArrow
- Not ideal for large-scale data lakes (no built-in partitioning)
Lance
Lance is an ML-native columnar format built on Arrow, with integrated vector search and versioning.
import lancedb
import polars as pl
from sentence_transformers import SentenceTransformer
# Create Lance dataset
db = lancedb.connect("./data.lance")
df = pl.DataFrame({
"id": [1, 2, 3],
"text": ["Hello world", "Goodbye world", "Machine learning"],
"vector": [[0.1] * 128, [0.2] * 128, [0.3] * 128] # Embeddings
})
# Write (creates .lance directory)
table = db.create_table("my_table", df)
# Append more data
table.add(df2)
# Vector search
results = table.search([0.1] * 128).limit(5).to_pandas()
# Versioned updates (like git)
table = db.create_table("versioned", df, mode="overwrite")
# Each overwrite creates a new version
table.checkout(version=1) # Access previous version
# Cloud storage
db = lancedb.connect("s3://bucket/dataset/")
Lance advantages:
- Built-in vector indexes (IVF_PQ, HNSW) - no separate DB needed
- Multi-modal: store images, audio, text, vectors in same table
- Version control for datasets
- Zero-copy reads via memory mapping
- No metadata catalog needed (self-contained)
Zarr
Zarr is a chunked, compressed N-dimensional array format, popular in ML, geospatial, and scientific computing.
import zarr
import numpy as np
# Create Zarr array (chunked, compressed)
z = zarr.open(
'data.zarr',
mode='w',
shape=(1000000, 1000), # Large 2D array
chunks=(10000, 1000), # Chunk size
dtype='f4',
compressor=zarr.Blosc(cname='zstd', clevel=3)
)
# Write chunks
z[:10000, :] = np.random.rand(10000, 1000).astype('f4')
# Read partial (only loads needed chunks)
slice = z[5000:6000, :]
# Group (like HDF5 groups)
g = zarr.open_group('experiment.zarr', mode='w')
g.create_dataset('images', data=image_array, chunks=True)
g.create_dataset('labels', data=label_array)
# Cloud storage (s3://)
import s3fs
fs = s3fs.S3FileSystem(anon=False)
store = s3fs.S3Map(root='mybucket/data.zarr', s3=fs)
zarr.open(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100), dtype='f4')
Zarr advantages:
- Parallel reads/writes across chunks
- Cloud-optimized (each chunk is a separate object)
- Schema flexibility (groups, hierarchies)
- Good for terabyte-scale arrays
- Growing ecosystem: xarray, dask, napari
Avro
Row-based format with rich schema evolution. Common in streaming (Kafka).
import fastavro
import json
# Define schema
schema = {
"type": "record",
"name": "Event",
"fields": [
{"name": "id", "type": "int"},
{"name": "event_type", "type": "string"},
{"name": "timestamp", "type": "long"} # Unix epoch
]
}
# Write Avro file
with open("events.avro", "wb") as out:
fastavro.writer(out, schema, [
{"id": 1, "event_type": "click", "timestamp": 1700000000},
{"id": 2, "event_type": "view", "timestamp": 1700000001}
])
# Read
with open("events.avro", "rb") as fo:
records = list(fastavro.reader(fo))
# Kafka integration (confluent-kafka)
from confluent_kafka import SerializingProducer
from confluent_kafka.schema_registry.avro import AvroSerializer
# Schema Registry integration ensures compatibility
Avro vs Parquet:
- Avro: row-based, append-only, streaming-friendly
- Parquet: columnar, analytics-friendly, predicate pushdown
- Convert:
polars.read_avro()→write_parquet()for ETL
ORC
Optimized Row Columnar, primarily for Hive/Hadoop ecosystems.
import pyarrow.orc as orc
# Write
table = pa.table({
"id": [1, 2, 3],
"value": [100.0, 200.0, 150.0]
})
orc.write_table(table, "data.orc")
# Read
table = orc.read_table("data.orc")
df = table.to_pandas()
# Stripe-level statistics (similar to Parquet row groups)
ORC vs Parquet:
- ORC: Hive-centric, ACID transactions, better compression for Hive queries
- Parquet: More ecosystem support (Spark, DuckDB, Polars), better column pruning
- Modern stacks typically prefer Parquet
Format Selection Guide
Use Case Matrix
| Use Case | Recommended Format | Reason |
|---|---|---|
| Data lake analytics | Parquet | Mature, partitioning, ecosystem |
| ML training data | Arrow/Feather or Lance | Zero-copy, vector support |
| Geospatial arrays | Zarr | Chunked, N-dimensional, cloud-optimized |
| Streaming/Kafka | Avro | Schema evolution, row-based |
| Legacy Hive | ORC | Compatibility |
| Feature stores | Lance or Delta | Versioning, vectors |
| IPC between processes | Arrow IPC or Feather | Zero-copy, fast |
| Quick exports | Parquet (Zstd) | Good compression/decompression speed |
Compression Codec Comparison
| Codec | Compression Ratio | Speed (Compress/Decompress) | Best For |
|---|---|---|---|
| Snappy | Low (~2:1) | ⚡⚡⚡ Fast | Fast analytics, default Parquet |
| Zstd | Medium-High (~4:1) | ⚡⚡ Fast | General purpose, good balance |
| LZ4 | Low-Medium (~2.5:1) | ⚡⚡⚡ Very fast | Real-time streaming |
| Gzip | High (~5:1) | ⚡ Slow | Archival, cold storage |
| Blosc (zstd) | Medium | ⚡⚡ | Zarr arrays |
Advanced Patterns
Converting Formats
# Avro → Parquet (ETL)
import polars as pl
df = pl.read_avro("input.avro")
df.write_parquet("output.parquet")
# Arrow → Lance (ML pipeline)
import lancedb
table = pa.feather.read_table("data.feather")
db = lancedb.connect("./dataset.lance")
db.create_table("embeddings", table)
# Zarr → Parquet (geospatial to analytics)
import dask.array as da
z = da.from_zarr("sar.zarr")
df = z.mean(axis=0).to_dataframe() # Aggregate and convert
df.to_parquet("summary.parquet")
Partitioning Strategies
Parquet partition discovery:
# Hive-style: year=2024/month=01/day=01/
dataset = ds.dataset(
"s3://bucket/events/",
filesystem=s3_fs,
format="parquet",
partitioning=ds.HivePartitioning.discover()
)
# Directory partitioning: year/2024/month/01/
dataset = ds.dataset(
"s3://bucket/events/year=2024/month=01/",
filesystem=s3_fs
)
Lance partitioning: Built-in via to_lance():
df.write_lance("data.lance", partition_by=["year", "month"])
Zarr chunking:
# Chunk by spatial region for geospatial
z = zarr.open(
"satellite.zarr",
mode='w',
shape=(10000, 10000), # 10000x10000 pixels
chunks=(1000, 1000), # 1000x1000 tiles
dtype='float32'
)
Performance Tuning
Parquet
- Row group size: 100K-1M rows for optimal skipping
- Dictionary encoding: Enable for low-cardinality strings
- Compression: Zstd level 3 for balance, Snappy for speed
- Column order: Put high-selectivity columns first (min/max statistics better)
Lance
- Vector index type: IVF_PQ for large datasets (>1M), HNSW for smaller/higher recall
- Use
create_index()after bulk load, not during writes - Batch writes for throughput
Zarr
- Chunk size: Align with access pattern (e.g., time-series: chunk by time)
- Compression: Blosc+zstd, tune clevel (3-5)
- Consider Zipf+Shuffle filters for structured arrays
Emerging Formats (2024-2025)
- Lance (2022): Gaining traction in ML community, integrated with RAPIDS, Polars, PyTorch
- Soar (2023): Columnar format optimized for AI training (similar to Lance, different ecosystem)
- Vortex: Not widely adopted yet - if you mean Arrow's compute kernels (not a format)
- DuckDB's
.duckdbformat: Embedded SQLite-like for DuckDB persistence - Delta Lake / Iceberg (table formats): Already in lakehouse skill
Best Practices
- ✅ Default to Parquet - Broadest compatibility, good compression, ecosystem tooling
- ✅ Use Arrow/Feather for ML staging - Zero-copy between training frameworks
- ✅ Use Lance when vectors are first-class - No separate vector DB needed
- ✅ Use Zarr for N-dim arrays - Geospatial, video, 3D data
- ✅ Compress everything - Snappy (fast) or Zstd (balanced)
- ✅ Partition wisely - By date/region/tenant to enable pruning
- ❌ Don't use Avro for analytics (no column pruning, row-based)
- ❌ Don't use ORC unless in Hive Hadoop world
- ❌ Don't store wide tables in single Parquet files - Partition or use Delta/Iceberg
References
- Apache Parquet Format
- Apache Arrow Format
- Lance Format
- Zarr Format
- Apache Avro Specification
@data-engineering-core- Reading/writing with Polars/DuckDB@data-engineering-storage-lakehouse- Table formats built on these (Delta/Iceberg)@data-engineering-storage-remote-access- Using these formats with cloud storage backends (S3, GCS, Azure)@data-engineering-storage-authentication- Credential patterns for accessing cloud storage