data-engineering-storage-remote-access-integrations-pandas
SKILL.md
Pandas Integration with Remote Storage
Pandas leverages fsspec under the hood for cloud storage access (s3://, gs://, etc.). This makes reading from and writing to cloud storage straightforward.
Auto-Detection (Simplest)
Pandas automatically uses fsspec for cloud URIs:
import pandas as pd
# Read CSV/Parquet directly from cloud URIs
df = pd.read_csv("s3://bucket/data.csv")
df = pd.read_parquet("s3://bucket/data.parquet")
df = pd.read_json("gs://bucket/data.json")
# Compression is auto-detected
df = pd.read_csv("s3://bucket/data.csv.gz") # Automatically decompressed
Note: Auto-detection uses default credentials. For explicit auth, see below.
Explicit Filesystem (More Control)
import fsspec
import pandas as pd
# Create fsspec filesystem with configuration
fs = fsspec.filesystem("s3", anon=False) # Uses default credentials chain
# Open file through filesystem
with fs.open("s3://bucket/data.csv") as f:
df = pd.read_csv(f)
# Or pass filesystem directly (recommended for performance)
df = pd.read_parquet(
"s3://bucket/data.parquet",
filesystem=fs,
columns=["id", "value"], # Column pruning reduces data transfer
filters=[("date", ">=", "2024-01-01")] # Row group filtering
)
PyArrow Filesystem Backend
For better Arrow integration and zero-copy transfers:
import pyarrow.fs as fs
import pandas as pd
s3_fs = fs.S3FileSystem(region="us-east-1")
# Read with column filtering
df = pd.read_parquet(
"bucket/data.parquet", # Note: no s3:// prefix when using filesystem
filesystem=s3_fs,
columns=["id", "name", "value"]
)
# Write to cloud storage
df.to_parquet(
"s3://bucket/output/",
filesystem=s3_fs,
partition_cols=["year", "month"] # Partitioned write
)
Partitioned Writes
Write partitioned datasets efficiently:
import pandas as pd
df = pd.DataFrame({
"id": [1, 2, 3],
"year": [2024, 2024, 2023],
"month": [1, 2, 12],
"value": [100.0, 200.0, 150.0]
})
# Using fsspec
fs = fsspec.filesystem("s3")
df.to_parquet(
"s3://bucket/output/",
partition_cols=["year", "month"],
filesystem=fs
)
# Output structure: s3://bucket/output/year=2024/month=1/part-0.parquet
Authentication
- Auto-detection: Uses default credential chain (AWS_PROFILE, ~/.aws/credentials, IAM role)
- Explicit: Pass
key=,secret=tofsspec.filesystem()constructor - For S3-compatible (MinIO, Ceph):
fs = fsspec.filesystem("s3", client_kwargs={ "endpoint_url": "http://minio.local:9000" })
See @data-engineering-storage-authentication for detailed patterns.
Performance Tips
- Column pruning:
pd.read_parquet(columns=[...])only reads needed columns - Row group filtering: Use
filters=parameter for partitioned data - Cache results: Wrap filesystem with
simplecache::orfilecache::cached_fs = fsspec.filesystem("simplecache", target_protocol="s3") df = pd.read_parquet("simplecache::s3://bucket/data.parquet", filesystem=cached_fs) - Use Parquet, not CSV: Parquet supports pushdown, compression, and typed storage
- For large datasets: Consider PySpark or Dask instead of pandas (pandas loads everything into memory)
Limitations
- pandas loads entire DataFrame into memory - not suitable for datasets larger than RAM
- For lazy evaluation and better performance with large files, use
@data-engineering-core(Polars) - Multi-file reads require manual iteration (use
fs.glob()+ list comprehension)
Alternatives
- Polars (
@data-engineering-core): Faster, memory-mapped, lazy evaluation - Dask: Parallel pandas for out-of-core computation
- PySpark: Distributed processing for big data
References
- pandas I/O documentation
- fsspec documentation
@data-engineering-storage-remote-access/libraries/fsspec
Weekly Installs
6
Repository
legout/data-pla…t-skillsFirst Seen
Feb 11, 2026
Security Audits
Installed on
pi6
mcpjam4
claude-code4
junie4
windsurf4
zencoder4