skills/legout/data-platform-agent-skills/data-engineering-storage-remote-access-libraries-fsspec

data-engineering-storage-remote-access-libraries-fsspec

SKILL.md

fsspec: Universal Filesystem Interface

fsspec provides a unified API for local and remote filesystems, integrating seamlessly with pandas, xarray, Dask, and many other Python data tools.

Installation

# Core only (no remote support)
pip install fsspec

# With specific backends
pip install fsspec[s3]        # S3 via s3fs
pip install fsspec[gcs]       # GCS via gcsfs
pip install fsspec[s3,gcs,azure]  # Multiple backends

# Or install backends directly
pip install s3fs gcsfs adlfs

Basic Usage

import fsspec
import pandas as pd

# List available protocols
print(fsspec.available_protocols())
# ['file', 'memory', 'http', 'https', 's3', 's3a', 'gcs', 'gs', 'abfss', ...]

# Create filesystem instances
local_fs = fsspec.filesystem('file')
s3_fs = fsspec.filesystem('s3', anon=False)  # Uses boto3 credentials
gcs_fs = fsspec.filesystem('gcs')             # Uses GCP credentials

# Basic operations
s3_fs.ls('my-bucket/data/')                   # List files
s3_fs.exists('my-bucket/data/file.csv')       # Check existence
s3_fs.mkdir('my-bucket/new-folder')           # Create directory

# Read file as bytes
with s3_fs.open('s3://my-bucket/data/file.txt', 'rb') as f:
    content = f.read()

# Read CSV directly into pandas
with s3_fs.open('s3://my-bucket/data/large.csv', 'rb') as f:
    df = pd.read_csv(f, compression='gzip')

Protocol Chaining & Caching

# SimpleCache: Cache remote files locally for faster repeated access
import fsspec

# First read downloads, subsequent reads use cache
cached_file = fsspec.open_local(
    "simplecache::s3://my-bucket/large-file.nc",
    simplecache={'cache_storage': '/tmp/fsspec_cache', 'compression': None}
)

# Chain multiple protocols
# Read from HTTPS, cache locally, decompress on the fly
with fsspec.open(
    "simplecache::gzip::https://example.com/data.csv.gz",
    compression='gzip'
) as f:
    df = pd.read_csv(f)

# Other useful wrappers:
# - "filecache::" - Persistent disk cache
# - "gzip::" - Decompression
# - "zip::" - Zip file access

Advanced S3 Features

import s3fs

# Detailed S3 configuration
fs = s3fs.S3FileSystem(
    key='AKIA...',
    secret='...',
    token='...',              # Temporary session token
    client_kwargs={
        'region_name': 'us-east-1',
        'endpoint_url': 'https://s3-compatible.local',  # MinIO, etc.
    },
    config_kwargs={
        'max_pool_connections': 50,
        'retries': {'max_attempts': 5}
    },
    skip_instance_cache=True   # Don't cache bucket listings
)

# Async operations
import asyncio

async def read_multiple():
    fs = s3fs.S3FileSystem(asynchronous=True)
    await fs.set_session()  # Establish async session

    # Concurrent reads (use _cat_file for bytes)
    data = await asyncio.gather(
        fs._cat_file('bucket/file1.parquet'),
        fs._cat_file('bucket/file2.parquet'),
        fs._cat_file('bucket/file3.parquet')
    )
    return data

# S3-specific features
fs.find('my-bucket', prefix='data/2024')  # List with prefix
fs.du('my-bucket/data')                   # Disk usage
fs.rm('my-bucket/temp/', recursive=True)  # Recursive delete

Authentication

fsspec backends follow standard cloud authentication:

  1. Explicit credentials (passed to constructor)
  2. Environment variables (AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS, etc.)
  3. Config files (~/.aws/credentials, gcloud CLI)
  4. IAM roles / managed identities

See @data-engineering-storage-authentication for detailed patterns.

When to Use fsspec

Choose fsspec when:

  • You need broad ecosystem compatibility (pandas, xarray, Dask)
  • Working with multiple storage backends (S3, GCS, Azure, HTTP)
  • You need protocol chaining and caching features
  • Your workflow involves diverse data formats beyond Parquet

Performance Considerations

  • ✅ Use filecache:: instead of simplecache:: for persistent caching across sessions
  • ✅ Increase max_pool_connections for high concurrency
  • ✅ Use async API for many concurrent small file operations
  • ⚠️ For pure Parquet workflows with high throughput, consider pyarrow.fs instead
  • ⚠️ For maximum performance on large concurrent operations, consider obstore

Integration with Data Engineering Tools

  • Polars: pl.read_parquet("s3://bucket/file.parquet", storage_options={...})
  • DuckDB: duckdb.register_filesystem(fsspec.filesystem('s3'))
  • Pandas: pd.read_csv("s3://bucket/file.csv") (auto-detects fsspec)
  • PyArrow: Wrap fsspec with pyarrow.fs.PyFileSystem(fs.FSSpecHandler(fs))

For detailed integration patterns, see:

  • @data-engineering-storage-remote-access/integrations/polars
  • @data-engineering-storage-remote-access/integrations/duckdb
  • @data-engineering-storage-remote-access/integrations/pandas

References

Weekly Installs
6
First Seen
Feb 11, 2026
Installed on
pi6
mcpjam4
claude-code4
junie4
windsurf4
zencoder4