data-engineering-storage-remote-access
SKILL.md
Remote Storage Access
Comprehensive guide to accessing cloud storage (S3, GCS, Azure) and remote filesystems in Python. Covers three major libraries - fsspec, pyarrow.fs, and obstore - and their integration with data engineering tools.
Quick Comparison
| Feature | fsspec | pyarrow.fs | obstore |
|---|---|---|---|
| Best For | Broad compatibility, ecosystem integration | Arrow-native workflows, Parquet | High-throughput, performance-critical |
| Backends | S3, GCS, Azure, HTTP, FTP, 20+ more | S3, GCS, HDFS, local | S3, GCS, Azure, local |
| Performance | Good (with caching) | Excellent for Parquet | 9x faster for concurrent ops |
| Dependencies | Backend-specific (s3fs, gcsfs) | Bundled with PyArrow | Zero Python deps (Rust) |
| Async Support | Yes (aiohttp) | Limited | Native sync/async |
| DataFrame Integration | Universal | PyArrow-native | Via fsspec wrapper |
| Maturity | Very mature (2018+) | Mature | New (2025), rapidly evolving |
When to Use Which?
Use fsspec when:
- You need broad ecosystem compatibility (pandas, xarray, Dask)
- Working with multiple storage backends (S3, GCS, Azure, HTTP)
- You need protocol chaining and caching features
- Your workflow involves diverse data formats beyond Parquet
Use pyarrow.fs when:
- Your pipeline is Arrow/Parquet-native
- You need zero-copy integration with PyArrow datasets
- Predicate pushdown and column pruning are critical
- Working with partitioned Parquet datasets
Use obstore when:
- Performance is paramount (many small files, high concurrency)
- You need async/await support for concurrent operations
- You want minimal dependencies (Rust-based)
- Working with large-scale data ingestion/egestion
Skill Dependencies
Prerequisites:
@data-engineering-core- Polars, DuckDB, PyArrow basics@data-engineering-storage-authentication- AWS, GCP, Azure auth patterns@data-engineering-storage-formats- Parquet, Arrow, Lance, Zarr, Avro, ORC
Related:
@data-engineering-storage-lakehouse- Delta Lake, Iceberg on cloud storage@data-engineering-orchestration- dbt with cloud storage
Detailed Guides
Library Deep Dives
@data-engineering-storage-remote-access-libraries-fsspec- Universal filesystem interface@data-engineering-storage-remote-access-libraries-pyarrow-fs- Native Arrow integration@data-engineering-storage-remote-access-libraries-obstore- High-performance Rust
DataFrame Integrations
@data-engineering-storage-remote-access-integrations-polars- Polars + cloud URIs@data-engineering-storage-remote-access-integrations-duckdb- DuckDB HTTPFS extension@data-engineering-storage-remote-access-integrations-pandas- Pandas + remote files@data-engineering-storage-remote-access-integrations-pyarrow- PyArrow datasets@data-engineering-storage-remote-access-integrations-delta-lake- Delta on S3/GCS/Azure@data-engineering-storage-remote-access-integrations-iceberg- Iceberg with cloud catalogs
Infrastructure Patterns
@data-engineering-storage-authentication- AWS, GCP, Azure auth patterns, IAM roles, service principals- See
performance.mdin this skill - Caching, concurrency, async - See
patterns.mdin this skill - Incremental loading, partitioned writes, cross-cloud copy
Storage Formats
@data-engineering-storage-formats- Parquet, Arrow/Feather, Lance, Zarr, Avro, ORC
Quick Start Example
import fsspec
import pyarrow.fs as fs
import obstore as obs
# Method 1: fsspec (universal)
s3_fs = fsspec.filesystem('s3')
with s3_fs.open('s3://bucket/data.parquet', 'rb') as f:
df = pl.read_parquet(f)
# Method 2: pyarrow.fs (Arrow-native)
s3_pa = fs.S3FileSystem(region='us-east-1')
table = pq.read_table("bucket/data.parquet", filesystem=s3_pa)
# Method 3: obstore (high-performance)
from obstore.store import S3Store
store = S3Store(bucket='my-bucket', region='us-east-1')
data = obs.get(store, 'data.parquet').bytes()
# All approaches work - choose based on your performance and ecosystem needs
Authentication
All three libraries follow standard cloud authentication patterns: explicit credentials → environment variables → config files → IAM roles/Managed Identities.
See: @data-engineering-storage-authentication
Performance Optimization
Key strategies:
- Caching: fsspec's
SimpleCachefor repeated access - Concurrency: obstore async API for many small files
- Predicate pushdown: Filter at storage layer using partitioning
- Column pruning: Read only required columns
See: @data-engineering-storage-remote-access/performance.md
References
Weekly Installs
6
Repository
legout/data-pla…t-skillsFirst Seen
Feb 11, 2026
Security Audits
Installed on
pi6
mcpjam4
claude-code4
junie4
windsurf4
zencoder4