data-engineering-storage-remote-access-libraries-obstore
Originally fromlegout/data-platform-agent-skills
SKILL.md
obstore: High-Performance Rust-Based Storage
obstore (released 2025) provides a minimal, stateless API built on Rust's object_store crate, offering superior performance for concurrent operations (up to 9x faster than Python-based alternatives).
Installation
pip install obstore
# Or with conda
conda install -c conda-forge obstore
Core Concepts
obstore uses top-level functions (not methods) and a functional API. All operations are functions like obs.get(store, path), not store.get(path).
Creating Stores
import obstore as obs
from obstore.store import S3Store, GCSStore, AzureStore, LocalStore
# S3 Store
s3 = S3Store(
bucket="my-bucket",
region="us-east-1",
access_key_id="AKIA...",
secret_access_key="...",
# Or use environment credentials
)
# GCS Store
gcs = GCSStore(
bucket="my-bucket",
# Uses GOOGLE_APPLICATION_CREDENTIALS by default
)
# Azure Store
azure = AzureStore(
container="my-container",
account_name="myaccount",
account_key="...",
# Or use DefaultAzureCredential
)
# Local filesystem
local = LocalStore("/path/to/root")
# From environment (picks up standard env vars)
s3 = S3Store.from_env(bucket="my-bucket")
gcs = GCSStore.from_env(bucket="my-bucket")
Basic Operations
import obstore as obs
store = S3Store(bucket="my-bucket", region="us-east-1")
# Put object (bytes)
obs.put(store, "hello.txt", b"Hello, World!")
# Put from file
with open("local-file.csv", "rb") as f:
obs.put(store, "data/file.csv", f)
# Get object
response = obs.get(store, "hello.txt")
print(response.bytes()) # b"Hello, World!"
print(response.meta) # Object metadata (size, mtime, etag, etc.)
# Get range (efficient partial reads)
partial = obs.get_range(store, "large-file.bin", offset=0, length=1024)
# Stream download
stream = obs.get(store, "large-file.bin")
for chunk in stream.stream(min_chunk_size=8 * 1024 * 1024):
process(chunk)
# List objects (streaming, no pagination needed!)
for obj in obs.list(store, prefix="data/2024/"):
print(f"{obj['path']}: {obj['size']} bytes")
# List with delimiter (like directory listing)
result = obs.list_with_delimiter(store, prefix="data/")
print(result["common_prefixes"]) # "directories"
print(result["objects"]) # files
# Delete
obs.delete(store, "old-file.txt")
# Copy within same store
obs.copy(store, "src/file.txt", "dst/file.txt")
# Rename/move
obs.rename(store, "old-name.txt", "new-name.txt")
# Check existence (via head)
try:
meta = obs.head(store, "file.txt")
print(f"Exists: {meta['size']} bytes")
except obs.NotFoundError:
print("File not found")
Async API
import asyncio
import obstore as obs
from obstore.store import S3Store
async def main():
store = S3Store(bucket="my-bucket", region="us-east-1")
# Concurrent uploads
await asyncio.gather(
obs.put_async(store, "file1.txt", b"content1"),
obs.put_async(store, "file2.txt", b"content2"),
obs.put_async(store, "file3.txt", b"content3"),
)
# Concurrent downloads
responses = await asyncio.gather(
obs.get_async(store, "file1.txt"),
obs.get_async(store, "file2.txt"),
obs.get_async(store, "file3.txt"),
)
for resp in responses:
print(await resp.bytes_async())
asyncio.run(main())
Streaming Uploads
import asyncio
import obstore as obs
from obstore.store import S3Store
store = S3Store(bucket="my-bucket")
# Upload from generator (streaming, memory-efficient)
def data_generator():
for i in range(1000):
yield f"Row {i}\n".encode()
obs.put(store, "output.txt", data_generator())
# Upload from async iterator
async def async_data():
for i in range(1000):
await asyncio.sleep(0)
yield f"Row {i}\n".encode()
async def upload_async():
await obs.put_async(store, "output-async.txt", async_data())
asyncio.run(upload_async())
# Automatic multipart upload for large files
# (triggered automatically based on size)
with open("huge-file.bin", "rb") as f:
obs.put(store, "huge-file.bin", f) # Multi-part automatically
Arrow Integration
import obstore as obs
from obstore.store import S3Store
store = S3Store(bucket="my-bucket")
# Return list results as Arrow table (faster, more memory-efficient)
arrow_table = obs.list(store, prefix="data/", return_arrow=True)
print(arrow_table.schema)
# pyarrow.Schema
# ├── path: string
# ├── size: int64
# ├── last_modified: timestamp[ns]
# └── etag: string
# Process with PyArrow/Polars
import polars as pl
df = pl.from_arrow(arrow_table)
fsspec Compatibility
obstore provides an fsspec-compatible wrapper:
from obstore.fsspec import FsspecStore, register
import pyarrow.parquet as pq
# Method 1: Register as default handler for protocols
register()
# Now fsspec uses obstore internally
import fsspec
fs = fsspec.filesystem("s3", region="us-east-1")
# Method 2: Use FsspecStore directly
fs = FsspecStore("s3", bucket="my-bucket", region="us-east-1")
# or
fs = FsspecStore.from_store(s3_store_object)
# Use with PyArrow
parquet_file = pq.ParquetFile(
"s3://bucket/data/file.parquet",
filesystem=fs
)
When to Use obstore
Choose obstore when:
- ✅ Performance is paramount (many small files, high concurrency)
- ✅ You need async/await for concurrent operations
- ✅ Minimal dependencies are desired (Rust-based, no Python C extensions)
- ✅ Streaming uploads from generators/iterators
- ✅ Large-scale data ingestion/egestion
Performance Comparison
| Operation | fsspec | pyarrow.fs | obstore |
|---|---|---|---|
| Concurrent small files | Moderate | Moderate | 9x faster |
| Async support | Yes (aiohttp) | Limited | Native |
| Streaming uploads | Yes | Limited | Yes (efficient) |
| Parquet pushdown | Via PyArrow | Excellent | Via PyArrow |
| Maturity (2025) | Very high | High | Rapidly growing |
Authentication
See @data-engineering-storage-authentication for credential patterns. All S3Store, GCSStore, AzureStore constructors accept explicit credentials or use environment variables via from_env().
References
Weekly Installs
4
Repository
legout/data-agent-skillsFirst Seen
13 days ago
Security Audits
Installed on
opencode4
gemini-cli4
github-copilot4
codex4
kimi-cli4
amp4