data-engineering-storage-remote-access-libraries-obstore

Installation

SKILL.md

obstore: High-Performance Rust-Based Storage

obstore (released 2025) provides a minimal, stateless API built on Rust's object_store crate, offering superior performance for concurrent operations (up to 9x faster than Python-based alternatives).

Installation

pip install obstore

# Or with conda
conda install -c conda-forge obstore

Core Concepts

obstore uses top-level functions (not methods) and a functional API. All operations are functions like obs.get(store, path), not store.get(path).

Creating Stores

import obstore as obs
from obstore.store import S3Store, GCSStore, AzureStore, LocalStore

# S3 Store
s3 = S3Store(
    bucket="my-bucket",
    region="us-east-1",
    access_key_id="AKIA...",
    secret_access_key="...",
    # Or use environment credentials
)

# GCS Store
gcs = GCSStore(
    bucket="my-bucket",
    # Uses GOOGLE_APPLICATION_CREDENTIALS by default
)

# Azure Store
azure = AzureStore(
    container="my-container",
    account_name="myaccount",
    account_key="...",
    # Or use DefaultAzureCredential
)

# Local filesystem
local = LocalStore("/path/to/root")

# From environment (picks up standard env vars)
s3 = S3Store.from_env(bucket="my-bucket")
gcs = GCSStore.from_env(bucket="my-bucket")

Basic Operations

import obstore as obs

store = S3Store(bucket="my-bucket", region="us-east-1")

# Put object (bytes)
obs.put(store, "hello.txt", b"Hello, World!")

# Put from file
with open("local-file.csv", "rb") as f:
    obs.put(store, "data/file.csv", f)

# Get object
response = obs.get(store, "hello.txt")
print(response.bytes())   # b"Hello, World!"
print(response.meta)      # Object metadata (size, mtime, etag, etc.)

# Get range (efficient partial reads)
partial = obs.get_range(store, "large-file.bin", offset=0, length=1024)

# Stream download
stream = obs.get(store, "large-file.bin")
for chunk in stream.stream(min_chunk_size=8 * 1024 * 1024):
    process(chunk)

# List objects (streaming, no pagination needed!)
for obj in obs.list(store, prefix="data/2024/"):
    print(f"{obj['path']}: {obj['size']} bytes")

# List with delimiter (like directory listing)
result = obs.list_with_delimiter(store, prefix="data/")
print(result["common_prefixes"])  # "directories"
print(result["objects"])          # files

# Delete
obs.delete(store, "old-file.txt")

# Copy within same store
obs.copy(store, "src/file.txt", "dst/file.txt")

# Rename/move
obs.rename(store, "old-name.txt", "new-name.txt")

# Check existence (via head)
try:
    meta = obs.head(store, "file.txt")
    print(f"Exists: {meta['size']} bytes")
except obs.NotFoundError:
    print("File not found")

Async API

import asyncio
import obstore as obs
from obstore.store import S3Store

async def main():
    store = S3Store(bucket="my-bucket", region="us-east-1")

    # Concurrent uploads
    await asyncio.gather(
        obs.put_async(store, "file1.txt", b"content1"),
        obs.put_async(store, "file2.txt", b"content2"),
        obs.put_async(store, "file3.txt", b"content3"),
    )

    # Concurrent downloads
    responses = await asyncio.gather(
        obs.get_async(store, "file1.txt"),
        obs.get_async(store, "file2.txt"),
        obs.get_async(store, "file3.txt"),
    )

    for resp in responses:
        print(await resp.bytes_async())

asyncio.run(main())

Streaming Uploads

import asyncio
import obstore as obs
from obstore.store import S3Store

store = S3Store(bucket="my-bucket")

# Upload from generator (streaming, memory-efficient)
def data_generator():
    for i in range(1000):
        yield f"Row {i}\n".encode()

obs.put(store, "output.txt", data_generator())

# Upload from async iterator
async def async_data():
    for i in range(1000):
        await asyncio.sleep(0)
        yield f"Row {i}\n".encode()

async def upload_async():
    await obs.put_async(store, "output-async.txt", async_data())

asyncio.run(upload_async())

# Automatic multipart upload for large files
# (triggered automatically based on size)
with open("huge-file.bin", "rb") as f:
    obs.put(store, "huge-file.bin", f)  # Multi-part automatically

Arrow Integration

import obstore as obs
from obstore.store import S3Store

store = S3Store(bucket="my-bucket")

# Return list results as Arrow table (faster, more memory-efficient)
arrow_table = obs.list(store, prefix="data/", return_arrow=True)
print(arrow_table.schema)
# pyarrow.Schema
# ├── path: string
# ├── size: int64
# ├── last_modified: timestamp[ns]
# └── etag: string

# Process with PyArrow/Polars
import polars as pl
df = pl.from_arrow(arrow_table)

fsspec Compatibility

obstore provides an fsspec-compatible wrapper:

from obstore.fsspec import FsspecStore, register
import pyarrow.parquet as pq

# Method 1: Register as default handler for protocols
register()
# Now fsspec uses obstore internally
import fsspec
fs = fsspec.filesystem("s3", region="us-east-1")

# Method 2: Use FsspecStore directly
fs = FsspecStore("s3", bucket="my-bucket", region="us-east-1")
# or
fs = FsspecStore.from_store(s3_store_object)

# Use with PyArrow
parquet_file = pq.ParquetFile(
    "s3://bucket/data/file.parquet",
    filesystem=fs
)

When to Use obstore

Choose obstore when:

✅ Performance is paramount (many small files, high concurrency)
✅ You need async/await for concurrent operations
✅ Minimal dependencies are desired (Rust-based, no Python C extensions)
✅ Streaming uploads from generators/iterators
✅ Large-scale data ingestion/egestion

Performance Comparison

Operation	fsspec	pyarrow.fs	obstore
Concurrent small files	Moderate	Moderate	9x faster
Async support	Yes (aiohttp)	Limited	Native
Streaming uploads	Yes	Limited	Yes (efficient)
Parquet pushdown	Via PyArrow	Excellent	Via PyArrow
Maturity (2025)	Very high	High	Rapidly growing

Authentication

See @data-engineering-storage-authentication for credential patterns. All S3Store, GCSStore, AzureStore constructors accept explicit credentials or use environment variables via from_env().

References

Related skills

More from legout/data-agent-skills

Installs

Repository

legout/data-agent-skills

First Seen

Mar 1, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykFail

data-engineering-storage-remote-access-libraries-obstore

obstore: High-Performance Rust-Based Storage

Installation

Core Concepts

Creating Stores

Basic Operations

Async API

Streaming Uploads

Arrow Integration

fsspec Compatibility

When to Use obstore

Performance Comparison

Authentication

References

More from legout/data-agent-skills

data-engineering

data-engineering-storage-remote-access-integrations-iceberg

data-science-eda

data-science-notebooks

data-engineering-storage-remote-access-libraries-fsspec

data-engineering-storage-remote-access-integrations-duckdb