data-engineering-storage-remote-access-integrations-pyarrow
PyArrow Remote Storage Integration
PyArrow's parquet and dataset modules work seamlessly with cloud storage through its native filesystem abstraction and fsspec compatibility.
Native PyArrow Filesystem
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow.fs as fs
# Create S3 filesystem
s3_fs = fs.S3FileSystem(region="us-east-1")
# Read single file with column filtering
table = pq.read_table(
"bucket/file.parquet", # Note: no s3:// prefix
filesystem=s3_fs,
columns=["id", "value"] # Column pruning
)
# Dataset with filtering and partitioning
dataset = ds.dataset(
"bucket/dataset/",
filesystem=s3_fs,
format="parquet",
partitioning=ds.HivePartitioning.discover()
)
# Filter pushdown (only reads matching files/row groups)
table = dataset.to_table(
filter=(ds.field("year") == 2024) & (ds.field("value") > 100),
columns=["id", "value", "timestamp"]
)
# Batch scanning for large datasets
scanner = dataset.scanner(
filter=ds.field("value") > 0,
batch_size=65536,
use_threads=True
)
for batch in scanner.to_batches():
process(batch)
fsspec Integration
PyArrow automatically bridges to fsspec for Parquet files:
import fsspec
import pyarrow.parquet as pq
fs = fsspec.filesystem("s3")
# Open via fsspec
with fs.open("s3://bucket/file.parquet", "rb") as f:
table = pq.read_table(f)
# Or use URI directly (fsspec auto-detected if installed)
table = pq.read_table("s3://bucket/file.parquet")
obstore fsspec Wrapper
Use obstore's high-performance fsspec wrapper for concurrent operations:
from obstore.fsspec import FsspecStore
import pyarrow.parquet as pq
# Create obstore-backed fsspec filesystem
fs = FsspecStore("s3", bucket="my-bucket", region="us-east-1")
# Use with PyArrow
table = pq.read_table("data/file.parquet", filesystem=fs)
Dataset Scanning Patterns
See @data-engineering-storage-remote-access/patterns.md for advanced patterns including:
- Incremental loading with checkpoint tracking
- Partitioned writes with Hive partitioning
- Cross-cloud copying
- Performance optimizations (predicate pushdown, column pruning)
Authentication
See @data-engineering-storage-authentication for S3, GCS, Azure credential configuration with PyArrow filesystems.
Performance Tips
- Column pruning: Always specify
columns=[...]to reduce data transfer - Filter pushdown: Use
dataset.scanner(filter=...)for predicate pushdown - Row group pruning: Parquet row groups enable partial file reads
- Threading: Enable
use_threads=Truein scanner for CPU-bound ops - Batch size: Tune
batch_sizebased on downstream processing needs - File format: Prefer Parquet over CSV/JSON for compression and pushdown
References
- PyArrow Filesystems Guide
- PyArrow Dataset Guide
@data-engineering-storage-remote-access/libraries/pyarrow-fs- PyArrow.fs library details
More from legout/data-platform-agent-skills
data-science-eda
Exploratory Data Analysis (EDA): profiling, visualization, correlation analysis, and data quality checks. Use when understanding dataset structure, distributions, relationships, or preparing for feature engineering and modeling.
13data-science-visualization
Data visualization for Python: Matplotlib, Seaborn, Plotly, Altair, hvPlot/HoloViz, and Bokeh. Use when creating exploratory charts, interactive dashboards, publication-quality figures, or choosing the right library for your data and audience.
12data-engineering-core
Core Python data engineering: Polars, DuckDB, PyArrow, PostgreSQL, ETL patterns, performance tuning, and resilient pipeline construction. Use when building or reviewing batch ETL/dataframe/SQL pipelines in Python.
10data-science-feature-engineering
Feature engineering for machine learning: encoding, scaling, transformations, datetime features, text features, and feature selection. Use when preparing data for modeling or improving model performance through better representations.
10data-science-notebooks
Interactive notebooks for data science: Jupyter, JupyterLab, and marimo. Use for exploratory analysis, reproducible research, documentation, and sharing insights with stakeholders.
9data-engineering-best-practices
Data engineering best practices: medallion architecture, dataset lifecycle, partitioning, file sizing, schema evolution, and append/overwrite/merge patterns across Polars, PyArrow, DuckDB, Delta Lake, and Iceberg. Use when designing production data pipelines or reviewing data platform decisions.
8