skills/legout/data-platform-agent-skills/data-engineering-storage-remote-access-libraries-pyarrow-fs

data-engineering-storage-remote-access-libraries-pyarrow-fs

SKILL.md

PyArrow.fs: Native Arrow Filesystems

PyArrow provides its own filesystem abstraction optimized for Arrow/Parquet workflows with zero-copy integration.

Installation

# Bundled with PyArrow - no extra deps
pip install pyarrow

Basic Usage

import pyarrow.fs as fs
from pyarrow import parquet as pq

# From URI - auto-detects filesystem type
s3_fs, path = fs.FileSystem.from_uri("s3://bucket/path/to/data/")
print(type(s3_fs))  # <class 'pyarrow._fs.S3FileSystem'>
print(path)         # 'path/to/data/'

# GCS via URI
gcs_fs, path = fs.FileSystem.from_uri("gs://my-bucket/data/")

# Local filesystem
local_fs, path = fs.FileSystem.from_uri("file:///home/user/data/")

S3 Configuration

import pyarrow.fs as fs
from pyarrow.fs import S3FileSystem

# Method 1: From URI with options
s3_fs = S3FileSystem(
    access_key='AKIA...',
    secret_key='...',
    session_token='...',           # For temporary credentials
    region='us-west-2',
    endpoint_override='https://minio.local:9000',  # S3-compatible
    scheme='https',
    proxy_options={'scheme': 'http', 'host': 'proxy.company.com', 'port': 8080},
    allow_bucket_creation=True,
    retry_strategy=fs.AwsStandardS3RetryStrategy(max_attempts=5)
)

# Method 2: From URI (reads from environment/AWS config)
s3_fs, path = fs.FileSystem.from_uri("s3://my-bucket/data/")

# File operations (bucket/key paths, not s3:// URIs)
info = s3_fs.get_file_info("bucket/file.parquet")
print(info.size)           # File size in bytes
print(info.mtime)          # Modification time

# Open input stream
with s3_fs.open_input_stream("bucket/file.parquet") as f:
    data = f.read()

# Open output stream for writing
with s3_fs.open_output_stream("bucket/output.parquet") as f:
    f.write(parquet_bytes)

# Copy and delete
s3_fs.copy_file("bucket/src.parquet", "bucket/dst.parquet")
s3_fs.delete_file("bucket/old.parquet")

Working with Parquet Datasets

import pyarrow.dataset as ds
import pyarrow.fs as fs

# Create S3 filesystem
s3_fs = fs.S3FileSystem(region='us-east-1')

# Load partitioned dataset
dataset = ds.dataset(
    "bucket/dataset/",
    filesystem=s3_fs,
    format="parquet",
    partitioning=ds.HivePartitioning.discover()
)

print(dataset.schema)
print(f"Rows: {dataset.count_rows()}")

# Filter pushdown (only reads relevant files)
table = dataset.to_table(
    filter=(ds.field("year") == 2024) & (ds.field("month") > 6),
    columns=["id", "value", "timestamp"]  # Column pruning
)

# Scan with custom options
scanner = dataset.scanner(
    filter=ds.field("value") > 100,
    batch_size=65536,
    use_threads=True
)

for batch in scanner.to_batches():
    process(batch)

Azure Support via FSSpec Bridge

import adlfs
import pyarrow.fs as fs
import pyarrow.dataset as ds

# Create Azure filesystem via fsspec
azure_fs = adlfs.AzureBlobFileSystem(
    account_name="myaccount",
    account_key="...",
    tenant_id="...",
    client_id="...",
    client_secret="..."
)

# Wrap in PyArrow filesystem
pa_fs = fs.PyFileSystem(fs.FSSpecHandler(azure_fs))

# Use with PyArrow dataset
dataset = ds.dataset(
    "container/path/",
    filesystem=pa_fs,
    format="parquet"
)

Authentication

See @data-engineering-storage-authentication for S3, GCS, Azure credential configuration.

When to Use PyArrow.fs

Choose pyarrow.fs when:

  • Your pipeline is Arrow/Parquet-native
  • You need zero-copy integration with PyArrow datasets
  • Predicate pushdown and column pruning are critical
  • Working with partitioned Parquet datasets
  • You want minimal dependencies (included in PyArrow)

Performance Considerations

  • Column pruning: Use columns= parameter to read only needed columns
  • Predicate pushdown: Filter at dataset level to skip reading irrelevant files
  • Batch scanning: Use scanner.to_batches() for large datasets
  • Threading: Enable use_threads=True for CPU-bound operations
  • ⚠️ For ecosystem integration (pandas, Dask, etc.), fsspec may be more convenient
  • ⚠️ For maximum async performance with many small files, consider obstore

Integration

  • Polars: pl.scan_pyarrow_dataset(dataset) for lazy evaluation
  • PyArrow datasets: Native integration (this is the PyArrow API)
  • Delta Lake/Iceberg: Use PyArrow filesystem when constructing dataset objects

References

Weekly Installs
7
First Seen
Feb 11, 2026
Installed on
pi6
claude-code5
mcpjam4
kilo4
windsurf4
zencoder4