data-engineering-storage-remote-access-libraries-fsspec
Originally fromlegout/data-platform-agent-skills
SKILL.md
fsspec: Universal Filesystem Interface
fsspec provides a unified API for local and remote filesystems, integrating seamlessly with pandas, xarray, Dask, and many other Python data tools.
Installation
# Core only (no remote support)
pip install fsspec
# With specific backends
pip install fsspec[s3] # S3 via s3fs
pip install fsspec[gcs] # GCS via gcsfs
pip install fsspec[s3,gcs,azure] # Multiple backends
# Or install backends directly
pip install s3fs gcsfs adlfs
Basic Usage
import fsspec
import pandas as pd
# List available protocols
print(fsspec.available_protocols())
# ['file', 'memory', 'http', 'https', 's3', 's3a', 'gcs', 'gs', 'abfss', ...]
# Create filesystem instances
local_fs = fsspec.filesystem('file')
s3_fs = fsspec.filesystem('s3', anon=False) # Uses boto3 credentials
gcs_fs = fsspec.filesystem('gcs') # Uses GCP credentials
# Basic operations
s3_fs.ls('my-bucket/data/') # List files
s3_fs.exists('my-bucket/data/file.csv') # Check existence
s3_fs.mkdir('my-bucket/new-folder') # Create directory
# Read file as bytes
with s3_fs.open('s3://my-bucket/data/file.txt', 'rb') as f:
content = f.read()
# Read CSV directly into pandas
with s3_fs.open('s3://my-bucket/data/large.csv', 'rb') as f:
df = pd.read_csv(f, compression='gzip')
Protocol Chaining & Caching
# SimpleCache: Cache remote files locally for faster repeated access
import fsspec
# First read downloads, subsequent reads use cache
cached_file = fsspec.open_local(
"simplecache::s3://my-bucket/large-file.nc",
simplecache={'cache_storage': '/tmp/fsspec_cache', 'compression': None}
)
# Chain multiple protocols
# Read from HTTPS, cache locally, decompress on the fly
with fsspec.open(
"simplecache::gzip::https://example.com/data.csv.gz",
compression='gzip'
) as f:
df = pd.read_csv(f)
# Other useful wrappers:
# - "filecache::" - Persistent disk cache
# - "gzip::" - Decompression
# - "zip::" - Zip file access
Advanced S3 Features
import s3fs
# Detailed S3 configuration
fs = s3fs.S3FileSystem(
key='AKIA...',
secret='...',
token='...', # Temporary session token
client_kwargs={
'region_name': 'us-east-1',
'endpoint_url': 'https://s3-compatible.local', # MinIO, etc.
},
config_kwargs={
'max_pool_connections': 50,
'retries': {'max_attempts': 5}
},
skip_instance_cache=True # Don't cache bucket listings
)
# Async operations
import asyncio
async def read_multiple():
fs = s3fs.S3FileSystem(asynchronous=True)
await fs.set_session() # Establish async session
# Concurrent reads (use _cat_file for bytes)
data = await asyncio.gather(
fs._cat_file('bucket/file1.parquet'),
fs._cat_file('bucket/file2.parquet'),
fs._cat_file('bucket/file3.parquet')
)
return data
# S3-specific features
fs.find('my-bucket', prefix='data/2024') # List with prefix
fs.du('my-bucket/data') # Disk usage
fs.rm('my-bucket/temp/', recursive=True) # Recursive delete
Authentication
fsspec backends follow standard cloud authentication:
- Explicit credentials (passed to constructor)
- Environment variables (AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS, etc.)
- Config files (~/.aws/credentials, gcloud CLI)
- IAM roles / managed identities
See @data-engineering-storage-authentication for detailed patterns.
When to Use fsspec
Choose fsspec when:
- You need broad ecosystem compatibility (pandas, xarray, Dask)
- Working with multiple storage backends (S3, GCS, Azure, HTTP)
- You need protocol chaining and caching features
- Your workflow involves diverse data formats beyond Parquet
Performance Considerations
- ✅ Use
filecache::instead ofsimplecache::for persistent caching across sessions - ✅ Increase
max_pool_connectionsfor high concurrency - ✅ Use async API for many concurrent small file operations
- ⚠️ For pure Parquet workflows with high throughput, consider
pyarrow.fsinstead - ⚠️ For maximum performance on large concurrent operations, consider
obstore
Integration with Data Engineering Tools
- Polars:
pl.read_parquet("s3://bucket/file.parquet", storage_options={...}) - DuckDB:
duckdb.register_filesystem(fsspec.filesystem('s3')) - Pandas:
pd.read_csv("s3://bucket/file.csv")(auto-detects fsspec) - PyArrow: Wrap fsspec with
pyarrow.fs.PyFileSystem(fs.FSSpecHandler(fs))
For detailed integration patterns, see:
@data-engineering-storage-remote-access/integrations/polars@data-engineering-storage-remote-access/integrations/duckdb@data-engineering-storage-remote-access/integrations/pandas
References
Weekly Installs
4
Repository
legout/data-agent-skillsFirst Seen
14 days ago
Security Audits
Installed on
opencode4
gemini-cli4
github-copilot4
codex4
kimi-cli4
amp4