data-engineering-storage-remote-access

Installation

SKILL.md

Remote Storage Access

Comprehensive guide to accessing cloud storage (S3, GCS, Azure) and remote filesystems in Python. Covers three major libraries - fsspec, pyarrow.fs, and obstore - and their integration with data engineering tools.

Quick Comparison

Feature	fsspec	pyarrow.fs	obstore
Best For	Broad compatibility, ecosystem integration	Arrow-native workflows, Parquet	High-throughput, performance-critical
Backends	S3, GCS, Azure, HTTP, FTP, 20+ more	S3, GCS, HDFS, local	S3, GCS, Azure, local
Performance	Good (with caching)	Excellent for Parquet	9x faster for concurrent ops
Dependencies	Backend-specific (s3fs, gcsfs)	Bundled with PyArrow	Zero Python deps (Rust)
Async Support	Yes (aiohttp)	Limited	Native sync/async
DataFrame Integration	Universal	PyArrow-native	Via fsspec wrapper
Maturity	Very mature (2018+)	Mature	New (2025), rapidly evolving

When to Use Which?

Use fsspec when:

You need broad ecosystem compatibility (pandas, xarray, Dask)
Working with multiple storage backends (S3, GCS, Azure, HTTP)
You need protocol chaining and caching features
Your workflow involves diverse data formats beyond Parquet

Use pyarrow.fs when:

Your pipeline is Arrow/Parquet-native
You need zero-copy integration with PyArrow datasets
Predicate pushdown and column pruning are critical
Working with partitioned Parquet datasets

Use obstore when:

Performance is paramount (many small files, high concurrency)
You need async/await support for concurrent operations
You want minimal dependencies (Rust-based)
Working with large-scale data ingestion/egestion

Skill Dependencies

Prerequisites:

@data-engineering-core - Polars, DuckDB, PyArrow basics
@data-engineering-storage-authentication - AWS, GCP, Azure auth patterns
@data-engineering-storage-formats - Parquet, Arrow, Lance, Zarr, Avro, ORC

@data-engineering-storage-lakehouse - Delta Lake, Iceberg on cloud storage
@data-engineering-orchestration - dbt with cloud storage

Detailed Guides

Library Deep Dives

@data-engineering-storage-remote-access-libraries-fsspec - Universal filesystem interface
@data-engineering-storage-remote-access-libraries-pyarrow-fs - Native Arrow integration
@data-engineering-storage-remote-access-libraries-obstore - High-performance Rust

DataFrame Integrations

@data-engineering-storage-remote-access-integrations-polars - Polars + cloud URIs
@data-engineering-storage-remote-access-integrations-duckdb - DuckDB HTTPFS extension
@data-engineering-storage-remote-access-integrations-pandas - Pandas + remote files
@data-engineering-storage-remote-access-integrations-pyarrow - PyArrow datasets
@data-engineering-storage-remote-access-integrations-delta-lake - Delta on S3/GCS/Azure
@data-engineering-storage-remote-access-integrations-iceberg - Iceberg with cloud catalogs

Infrastructure Patterns

@data-engineering-storage-authentication - AWS, GCP, Azure auth patterns, IAM roles, service principals
See performance.md in this skill - Caching, concurrency, async
See patterns.md in this skill - Incremental loading, partitioned writes, cross-cloud copy

Storage Formats

@data-engineering-storage-formats - Parquet, Arrow/Feather, Lance, Zarr, Avro, ORC

Quick Start Example

import fsspec
import pyarrow.fs as fs
import obstore as obs

# Method 1: fsspec (universal)
s3_fs = fsspec.filesystem('s3')
with s3_fs.open('s3://bucket/data.parquet', 'rb') as f:
    df = pl.read_parquet(f)

# Method 2: pyarrow.fs (Arrow-native)
s3_pa = fs.S3FileSystem(region='us-east-1')
table = pq.read_table("bucket/data.parquet", filesystem=s3_pa)

# Method 3: obstore (high-performance)
from obstore.store import S3Store
store = S3Store(bucket='my-bucket', region='us-east-1')
data = obs.get(store, 'data.parquet').bytes()

# All approaches work - choose based on your performance and ecosystem needs

Authentication

All three libraries follow standard cloud authentication patterns: explicit credentials → environment variables → config files → IAM roles/Managed Identities.

See: @data-engineering-storage-authentication

Performance Optimization

Key strategies:

Caching: fsspec's SimpleCache for repeated access
Concurrency: obstore async API for many small files
Predicate pushdown: Filter at storage layer using partitioning
Column pruning: Read only required columns

See: @data-engineering-storage-remote-access/performance.md

References

Related skills

More from legout/data-agent-skills

Installs

Repository

legout/data-agent-skills

First Seen

Mar 1, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykWarn