data-engineering-storage-remote-access-libraries-obstore
obstore: High-Performance Rust-Based Storage
obstore (released 2025) provides a minimal, stateless API built on Rust's object_store crate, offering superior performance for concurrent operations (up to 9x faster than Python-based alternatives).
Installation
pip install obstore
# Or with conda
conda install -c conda-forge obstore
Core Concepts
obstore uses top-level functions (not methods) and a functional API. All operations are functions like obs.get(store, path), not store.get(path).
Creating Stores
import obstore as obs
from obstore.store import S3Store, GCSStore, AzureStore, LocalStore
# S3 Store
s3 = S3Store(
bucket="my-bucket",
region="us-east-1",
access_key_id="AKIA...",
secret_access_key="...",
# Or use environment credentials
)
# GCS Store
gcs = GCSStore(
bucket="my-bucket",
# Uses GOOGLE_APPLICATION_CREDENTIALS by default
)
# Azure Store
azure = AzureStore(
container="my-container",
account_name="myaccount",
account_key="...",
# Or use DefaultAzureCredential
)
# Local filesystem
local = LocalStore("/path/to/root")
# From environment (picks up standard env vars)
s3 = S3Store.from_env(bucket="my-bucket")
gcs = GCSStore.from_env(bucket="my-bucket")
Basic Operations
import obstore as obs
store = S3Store(bucket="my-bucket", region="us-east-1")
# Put object (bytes)
obs.put(store, "hello.txt", b"Hello, World!")
# Put from file
with open("local-file.csv", "rb") as f:
obs.put(store, "data/file.csv", f)
# Get object
response = obs.get(store, "hello.txt")
print(response.bytes()) # b"Hello, World!"
print(response.meta) # Object metadata (size, mtime, etag, etc.)
# Get range (efficient partial reads)
partial = obs.get_range(store, "large-file.bin", offset=0, length=1024)
# Stream download
stream = obs.get(store, "large-file.bin")
for chunk in stream.stream(min_chunk_size=8 * 1024 * 1024):
process(chunk)
# List objects (streaming, no pagination needed!)
for obj in obs.list(store, prefix="data/2024/"):
print(f"{obj['path']}: {obj['size']} bytes")
# List with delimiter (like directory listing)
result = obs.list_with_delimiter(store, prefix="data/")
print(result["common_prefixes"]) # "directories"
print(result["objects"]) # files
# Delete
obs.delete(store, "old-file.txt")
# Copy within same store
obs.copy(store, "src/file.txt", "dst/file.txt")
# Rename/move
obs.rename(store, "old-name.txt", "new-name.txt")
# Check existence (via head)
try:
meta = obs.head(store, "file.txt")
print(f"Exists: {meta['size']} bytes")
except obs.NotFoundError:
print("File not found")
Async API
import asyncio
import obstore as obs
from obstore.store import S3Store
async def main():
store = S3Store(bucket="my-bucket", region="us-east-1")
# Concurrent uploads
await asyncio.gather(
obs.put_async(store, "file1.txt", b"content1"),
obs.put_async(store, "file2.txt", b"content2"),
obs.put_async(store, "file3.txt", b"content3"),
)
# Concurrent downloads
responses = await asyncio.gather(
obs.get_async(store, "file1.txt"),
obs.get_async(store, "file2.txt"),
obs.get_async(store, "file3.txt"),
)
for resp in responses:
print(await resp.bytes_async())
asyncio.run(main())
Streaming Uploads
import asyncio
import obstore as obs
from obstore.store import S3Store
store = S3Store(bucket="my-bucket")
# Upload from generator (streaming, memory-efficient)
def data_generator():
for i in range(1000):
yield f"Row {i}\n".encode()
obs.put(store, "output.txt", data_generator())
# Upload from async iterator
async def async_data():
for i in range(1000):
await asyncio.sleep(0)
yield f"Row {i}\n".encode()
async def upload_async():
await obs.put_async(store, "output-async.txt", async_data())
asyncio.run(upload_async())
# Automatic multipart upload for large files
# (triggered automatically based on size)
with open("huge-file.bin", "rb") as f:
obs.put(store, "huge-file.bin", f) # Multi-part automatically
Arrow Integration
import obstore as obs
from obstore.store import S3Store
store = S3Store(bucket="my-bucket")
# Return list results as Arrow table (faster, more memory-efficient)
arrow_table = obs.list(store, prefix="data/", return_arrow=True)
print(arrow_table.schema)
# pyarrow.Schema
# ├── path: string
# ├── size: int64
# ├── last_modified: timestamp[ns]
# └── etag: string
# Process with PyArrow/Polars
import polars as pl
df = pl.from_arrow(arrow_table)
fsspec Compatibility
obstore provides an fsspec-compatible wrapper:
from obstore.fsspec import FsspecStore, register
import pyarrow.parquet as pq
# Method 1: Register as default handler for protocols
register()
# Now fsspec uses obstore internally
import fsspec
fs = fsspec.filesystem("s3", region="us-east-1")
# Method 2: Use FsspecStore directly
fs = FsspecStore("s3", bucket="my-bucket", region="us-east-1")
# or
fs = FsspecStore.from_store(s3_store_object)
# Use with PyArrow
parquet_file = pq.ParquetFile(
"s3://bucket/data/file.parquet",
filesystem=fs
)
When to Use obstore
Choose obstore when:
- ✅ Performance is paramount (many small files, high concurrency)
- ✅ You need async/await for concurrent operations
- ✅ Minimal dependencies are desired (Rust-based, no Python C extensions)
- ✅ Streaming uploads from generators/iterators
- ✅ Large-scale data ingestion/egestion
Performance Comparison
| Operation | fsspec | pyarrow.fs | obstore |
|---|---|---|---|
| Concurrent small files | Moderate | Moderate | 9x faster |
| Async support | Yes (aiohttp) | Limited | Native |
| Streaming uploads | Yes | Limited | Yes (efficient) |
| Parquet pushdown | Via PyArrow | Excellent | Via PyArrow |
| Maturity (2025) | Very high | High | Rapidly growing |
Authentication
See @data-engineering-storage-authentication for credential patterns. All S3Store, GCSStore, AzureStore constructors accept explicit credentials or use environment variables via from_env().
References
More from legout/data-agent-skills
data-engineering-storage-remote-access-integrations-iceberg
Apache Iceberg catalog configuration for cloud storage (S3, GCS, Azure). Covers AWS Glue and REST catalogs, table scanning, and append/overwrite operations.
4data-science-notebooks
Interactive notebooks for data science: Jupyter, JupyterLab, and marimo. Use for exploratory analysis, reproducible research, documentation, and sharing insights with stakeholders.
4data-engineering-storage-remote-access-integrations-duckdb
Using DuckDB with remote cloud storage via HTTPFS extension, fsspec, and Delta Lake integration. Covers S3, GCS, Azure, and S3-compatible endpoints.
4flowerpower
Create and manage data pipelines using the FlowerPower framework with Hamilton DAGs and uv. Lightweight orchestration for batch ETL, data transformation, and ML pipelines. Integrates with Delta Lake, DuckDB, Polars, and cloud storage.
4data-engineering-observability
Observability and monitoring for data pipelines using OpenTelemetry (traces) and Prometheus (metrics). Covers instrumentation, dashboards, and alerting.
4data-engineering-best-practices
Data engineering best practices: medallion architecture, dataset lifecycle, partitioning, file sizing, schema evolution, and append/overwrite/merge patterns across Polars, PyArrow, DuckDB, Delta Lake, and Iceberg. Use when designing production data pipelines or reviewing data platform decisions.
4