data-engineering-storage-remote-access-libraries-fsspec
fsspec: Universal Filesystem Interface
fsspec provides a unified API for local and remote filesystems, integrating seamlessly with pandas, xarray, Dask, and many other Python data tools.
Installation
# Core only (no remote support)
pip install fsspec
# With specific backends
pip install fsspec[s3] # S3 via s3fs
pip install fsspec[gcs] # GCS via gcsfs
pip install fsspec[s3,gcs,azure] # Multiple backends
# Or install backends directly
pip install s3fs gcsfs adlfs
Basic Usage
import fsspec
import pandas as pd
# List available protocols
print(fsspec.available_protocols())
# ['file', 'memory', 'http', 'https', 's3', 's3a', 'gcs', 'gs', 'abfss', ...]
# Create filesystem instances
local_fs = fsspec.filesystem('file')
s3_fs = fsspec.filesystem('s3', anon=False) # Uses boto3 credentials
gcs_fs = fsspec.filesystem('gcs') # Uses GCP credentials
# Basic operations
s3_fs.ls('my-bucket/data/') # List files
s3_fs.exists('my-bucket/data/file.csv') # Check existence
s3_fs.mkdir('my-bucket/new-folder') # Create directory
# Read file as bytes
with s3_fs.open('s3://my-bucket/data/file.txt', 'rb') as f:
content = f.read()
# Read CSV directly into pandas
with s3_fs.open('s3://my-bucket/data/large.csv', 'rb') as f:
df = pd.read_csv(f, compression='gzip')
Protocol Chaining & Caching
# SimpleCache: Cache remote files locally for faster repeated access
import fsspec
# First read downloads, subsequent reads use cache
cached_file = fsspec.open_local(
"simplecache::s3://my-bucket/large-file.nc",
simplecache={'cache_storage': '/tmp/fsspec_cache', 'compression': None}
)
# Chain multiple protocols
# Read from HTTPS, cache locally, decompress on the fly
with fsspec.open(
"simplecache::gzip::https://example.com/data.csv.gz",
compression='gzip'
) as f:
df = pd.read_csv(f)
# Other useful wrappers:
# - "filecache::" - Persistent disk cache
# - "gzip::" - Decompression
# - "zip::" - Zip file access
Advanced S3 Features
import s3fs
# Detailed S3 configuration
fs = s3fs.S3FileSystem(
key='AKIA...',
secret='...',
token='...', # Temporary session token
client_kwargs={
'region_name': 'us-east-1',
'endpoint_url': 'https://s3-compatible.local', # MinIO, etc.
},
config_kwargs={
'max_pool_connections': 50,
'retries': {'max_attempts': 5}
},
skip_instance_cache=True # Don't cache bucket listings
)
# Async operations
import asyncio
async def read_multiple():
fs = s3fs.S3FileSystem(asynchronous=True)
await fs.set_session() # Establish async session
# Concurrent reads (use _cat_file for bytes)
data = await asyncio.gather(
fs._cat_file('bucket/file1.parquet'),
fs._cat_file('bucket/file2.parquet'),
fs._cat_file('bucket/file3.parquet')
)
return data
# S3-specific features
fs.find('my-bucket', prefix='data/2024') # List with prefix
fs.du('my-bucket/data') # Disk usage
fs.rm('my-bucket/temp/', recursive=True) # Recursive delete
Authentication
fsspec backends follow standard cloud authentication:
- Explicit credentials (passed to constructor)
- Environment variables (AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS, etc.)
- Config files (~/.aws/credentials, gcloud CLI)
- IAM roles / managed identities
See @data-engineering-storage-authentication for detailed patterns.
When to Use fsspec
Choose fsspec when:
- You need broad ecosystem compatibility (pandas, xarray, Dask)
- Working with multiple storage backends (S3, GCS, Azure, HTTP)
- You need protocol chaining and caching features
- Your workflow involves diverse data formats beyond Parquet
Performance Considerations
- ✅ Use
filecache::instead ofsimplecache::for persistent caching across sessions - ✅ Increase
max_pool_connectionsfor high concurrency - ✅ Use async API for many concurrent small file operations
- ⚠️ For pure Parquet workflows with high throughput, consider
pyarrow.fsinstead - ⚠️ For maximum performance on large concurrent operations, consider
obstore
Integration with Data Engineering Tools
- Polars:
pl.read_parquet("s3://bucket/file.parquet", storage_options={...}) - DuckDB:
duckdb.register_filesystem(fsspec.filesystem('s3')) - Pandas:
pd.read_csv("s3://bucket/file.csv")(auto-detects fsspec) - PyArrow: Wrap fsspec with
pyarrow.fs.PyFileSystem(fs.FSSpecHandler(fs))
For detailed integration patterns, see:
@data-engineering-storage-remote-access/integrations/polars@data-engineering-storage-remote-access/integrations/duckdb@data-engineering-storage-remote-access/integrations/pandas
References
More from legout/data-agent-skills
data-engineering
Comprehensive data engineering skill suite covering core libraries (Polars, DuckDB, PyArrow), lakehouse formats, cloud storage, orchestration, streaming, quality, observability, and AI/ML pipelines.
5data-engineering-storage-remote-access-integrations-iceberg
Apache Iceberg catalog configuration for cloud storage (S3, GCS, Azure). Covers AWS Glue and REST catalogs, table scanning, and append/overwrite operations.
4data-science-eda
Exploratory Data Analysis (EDA): profiling, visualization, correlation analysis, and data quality checks. Use when understanding dataset structure, distributions, relationships, or preparing for feature engineering and modeling.
4data-engineering-storage-remote-access-integrations-duckdb
Using DuckDB with remote cloud storage via HTTPFS extension, fsspec, and Delta Lake integration. Covers S3, GCS, Azure, and S3-compatible endpoints.
4data-engineering-storage-remote-access-libraries-pyarrow-fs
Native Arrow filesystem integration with PyArrow. Optimized for Parquet workflows, zero-copy data transfer, predicate pushdown, and column pruning. Covers S3, GCS, HDFS with PyArrow datasets.
4flowerpower
Create and manage data pipelines using the FlowerPower framework with Hamilton DAGs and uv. Lightweight orchestration for batch ETL, data transformation, and ML pipelines. Integrates with Delta Lake, DuckDB, Polars, and cloud storage.
4