data-engineering-storage-remote-access-integrations-pandas
Pandas Integration with Remote Storage
Pandas leverages fsspec under the hood for cloud storage access (s3://, gs://, etc.). This makes reading from and writing to cloud storage straightforward.
Auto-Detection (Simplest)
Pandas automatically uses fsspec for cloud URIs:
import pandas as pd
# Read CSV/Parquet directly from cloud URIs
df = pd.read_csv("s3://bucket/data.csv")
df = pd.read_parquet("s3://bucket/data.parquet")
df = pd.read_json("gs://bucket/data.json")
# Compression is auto-detected
df = pd.read_csv("s3://bucket/data.csv.gz") # Automatically decompressed
Note: Auto-detection uses default credentials. For explicit auth, see below.
Explicit Filesystem (More Control)
import fsspec
import pandas as pd
# Create fsspec filesystem with configuration
fs = fsspec.filesystem("s3", anon=False) # Uses default credentials chain
# Open file through filesystem
with fs.open("s3://bucket/data.csv") as f:
df = pd.read_csv(f)
# Or pass filesystem directly (recommended for performance)
df = pd.read_parquet(
"s3://bucket/data.parquet",
filesystem=fs,
columns=["id", "value"], # Column pruning reduces data transfer
filters=[("date", ">=", "2024-01-01")] # Row group filtering
)
PyArrow Filesystem Backend
For better Arrow integration and zero-copy transfers:
import pyarrow.fs as fs
import pandas as pd
s3_fs = fs.S3FileSystem(region="us-east-1")
# Read with column filtering
df = pd.read_parquet(
"bucket/data.parquet", # Note: no s3:// prefix when using filesystem
filesystem=s3_fs,
columns=["id", "name", "value"]
)
# Write to cloud storage
df.to_parquet(
"s3://bucket/output/",
filesystem=s3_fs,
partition_cols=["year", "month"] # Partitioned write
)
Partitioned Writes
Write partitioned datasets efficiently:
import pandas as pd
df = pd.DataFrame({
"id": [1, 2, 3],
"year": [2024, 2024, 2023],
"month": [1, 2, 12],
"value": [100.0, 200.0, 150.0]
})
# Using fsspec
fs = fsspec.filesystem("s3")
df.to_parquet(
"s3://bucket/output/",
partition_cols=["year", "month"],
filesystem=fs
)
# Output structure: s3://bucket/output/year=2024/month=1/part-0.parquet
Authentication
- Auto-detection: Uses default credential chain (AWS_PROFILE, ~/.aws/credentials, IAM role)
- Explicit: Pass
key=,secret=tofsspec.filesystem()constructor - For S3-compatible (MinIO, Ceph):
fs = fsspec.filesystem("s3", client_kwargs={ "endpoint_url": "http://minio.local:9000" })
See @data-engineering-storage-authentication for detailed patterns.
Performance Tips
- Column pruning:
pd.read_parquet(columns=[...])only reads needed columns - Row group filtering: Use
filters=parameter for partitioned data - Cache results: Wrap filesystem with
simplecache::orfilecache::cached_fs = fsspec.filesystem("simplecache", target_protocol="s3") df = pd.read_parquet("simplecache::s3://bucket/data.parquet", filesystem=cached_fs) - Use Parquet, not CSV: Parquet supports pushdown, compression, and typed storage
- For large datasets: Consider PySpark or Dask instead of pandas (pandas loads everything into memory)
Limitations
- pandas loads entire DataFrame into memory - not suitable for datasets larger than RAM
- For lazy evaluation and better performance with large files, use
@data-engineering-core(Polars) - Multi-file reads require manual iteration (use
fs.glob()+ list comprehension)
Alternatives
- Polars (
@data-engineering-core): Faster, memory-mapped, lazy evaluation - Dask: Parallel pandas for out-of-core computation
- PySpark: Distributed processing for big data
References
- pandas I/O documentation
- fsspec documentation
@data-engineering-storage-remote-access/libraries/fsspec
More from legout/data-agent-skills
data-engineering-storage-remote-access-libraries-obstore
High-performance Rust-based remote filesystem library. Covers store creation, basic operations, async API, streaming uploads, Arrow integration, and fsspec compatibility wrapper.
4data-engineering-storage-remote-access-integrations-iceberg
Apache Iceberg catalog configuration for cloud storage (S3, GCS, Azure). Covers AWS Glue and REST catalogs, table scanning, and append/overwrite operations.
4data-science-notebooks
Interactive notebooks for data science: Jupyter, JupyterLab, and marimo. Use for exploratory analysis, reproducible research, documentation, and sharing insights with stakeholders.
4data-engineering-storage-remote-access-integrations-duckdb
Using DuckDB with remote cloud storage via HTTPFS extension, fsspec, and Delta Lake integration. Covers S3, GCS, Azure, and S3-compatible endpoints.
4flowerpower
Create and manage data pipelines using the FlowerPower framework with Hamilton DAGs and uv. Lightweight orchestration for batch ETL, data transformation, and ML pipelines. Integrates with Delta Lake, DuckDB, Polars, and cloud storage.
4data-engineering-observability
Observability and monitoring for data pipelines using OpenTelemetry (traces) and Prometheus (metrics). Covers instrumentation, dashboards, and alerting.
4