skills/legout/data-agent-skills/accessing-cloud-storage

accessing-cloud-storage

SKILL.md

Accessing Cloud Storage

Comprehensive guide to accessing cloud storage (S3, GCS, Azure) and remote filesystems in Python. Covers three major libraries - fsspec, pyarrow.fs, and obstore - and their integration with data engineering tools.

Quick Comparison

Feature fsspec pyarrow.fs obstore
Best For Broad compatibility, ecosystem integration Arrow-native workflows, Parquet High-throughput, performance-critical
Backends S3, GCS, Azure, HTTP, FTP, 20+ more S3, GCS, HDFS, local S3, GCS, Azure, local
Performance Good (with caching) Excellent for Parquet 9x faster for concurrent ops
Dependencies Backend-specific (s3fs, gcsfs) Bundled with PyArrow Zero Python deps (Rust)
Async Support Yes (aiohttp) Limited Native sync/async
DataFrame Integration Universal PyArrow-native Via fsspec wrapper
Maturity Very mature (2018+) Mature New (2025), rapidly evolving

When to Use Which?

Use fsspec when:

  • You need broad ecosystem compatibility (pandas, xarray, Dask)
  • Working with multiple storage backends (S3, GCS, Azure, HTTP)
  • You need protocol chaining and caching features
  • Your workflow involves diverse data formats beyond Parquet

Use pyarrow.fs when:

  • Your pipeline is Arrow/Parquet-native
  • You need zero-copy integration with PyArrow datasets
  • Predicate pushdown and column pruning are critical
  • Working with partitioned Parquet datasets

Use obstore when:

  • Performance is paramount (many small files, high concurrency)
  • You need async/await support for concurrent operations
  • You want minimal dependencies (Rust-based)
  • Working with large-scale data ingestion/egestion

Skill Dependencies

Prerequisites:

  • @building-data-pipelines - Polars, DuckDB, PyArrow basics
  • AWS, GCP, Azure auth patterns (see Authentication section below)
  • @designing-data-storage - File formats (Parquet, Arrow, Lance) and lakehouse formats (Delta Lake, Iceberg, Hudi)

Related:

  • @orchestrating-data-pipelines - dbt with cloud storage

Detailed Guides

Library Deep Dives

This skill contains detailed guidance for all three libraries:

DataFrame Integrations

  • Polars - Native s3://, gs://, az:// URIs with lazy evaluation and predicate pushdown
  • DuckDB - HTTPFS extension for SQL queries directly on remote Parquet/JSON/CSV
  • Pandas - fsspec auto-detection for transparent cloud URI handling
  • PyArrow - Native filesystem with dataset scanning and batch processing

For detailed patterns, see DataFrame Integration below. For Delta Lake and Iceberg table formats on cloud storage:

  • @designing-data-storage - Delta Lake and Iceberg with cloud catalogs (S3/GCS/Azure)

Infrastructure Patterns

  • AWS, GCP, Azure auth patterns, IAM roles, service principals (see Authentication section below)
  • See performance.md in this skill - Caching, concurrency, async
  • See patterns.md in this skill - Incremental loading, partitioned writes, cross-cloud copy

Storage Formats

  • @designing-data-storage - Parquet, Arrow/Feather, Lance, Zarr, Avro, ORC

Quick Start Example

Library Approaches

import fsspec
import pyarrow.fs as fs
import pyarrow.parquet as pq
import obstore as obs

# Method 1: fsspec (universal)
s3_fs = fsspec.filesystem('s3')
with s3_fs.open('s3://bucket/data.parquet', 'rb') as f:
    data = f.read()

# Method 2: pyarrow.fs (Arrow-native)
s3_pa = fs.S3FileSystem(region='us-east-1')
table = pq.read_table("bucket/data.parquet", filesystem=s3_pa)

# Method 3: obstore (high-performance)
from obstore.store import S3Store
store = S3Store(bucket='my-bucket', region='us-east-1')
data = obs.get(store, 'data.parquet').bytes()

DataFrame Approaches

import polars as pl
import duckdb

# Polars: Native cloud URI (simplest)
df = pl.read_parquet("s3://bucket/data.parquet")
lazy_df = pl.scan_parquet("s3://bucket/dataset/**/*.parquet")

# DuckDB: SQL on remote files
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
df = con.sql("SELECT * FROM read_parquet('s3://bucket/data.parquet')").pl()

# All approaches work - choose based on your performance and ecosystem needs

Library Guides

fsspec Library Guide

fsspec provides a unified API for local and remote filesystems, integrating seamlessly with pandas, xarray, Dask, and many other Python data tools.

Installation

# Core only (no remote support)
pip install fsspec

# With specific backends
pip install fsspec[s3]        # S3 via s3fs
pip install fsspec[gcs]       # GCS via gcsfs
pip install fsspec[s3,gcs,azure]  # Multiple backends

# Or install backends directly
pip install s3fs gcsfs adlfs

Basic Usage

import fsspec
import pandas as pd

# List available protocols
print(fsspec.available_protocols())
# ['file', 'memory', 'http', 'https', 's3', 's3a', 'gcs', 'gs', 'abfss', ...]

# Create filesystem instances
local_fs = fsspec.filesystem('file')
s3_fs = fsspec.filesystem('s3', anon=False)  # Uses boto3 credentials
gcs_fs = fsspec.filesystem('gcs')             # Uses GCP credentials

# Basic operations
s3_fs.ls('my-bucket/data/')                   # List files
s3_fs.exists('s3://my-bucket/data/file.csv')       # Check existence
s3_fs.mkdir('my-bucket/new-folder')           # Create directory

# Read file as bytes
with s3_fs.open('s3://my-bucket/data/file.txt', 'rb') as f:
    content = f.read()

# Read CSV directly into pandas
with s3_fs.open('s3://my-bucket/data/large.csv', 'rb') as f:
    df = pd.read_csv(f, compression='gzip')

Protocol Chaining & Caching

# SimpleCache: Cache remote files locally for faster repeated access
import fsspec

# First read downloads, subsequent reads use cache
cached_file = fsspec.open_local(
    "simplecache::s3://my-bucket/large-file.nc",
    simplecache={'cache_storage': '/tmp/fsspec_cache', 'compression': None}
)

# Chain multiple protocols
# Read from HTTPS, cache locally, decompress on the fly
with fsspec.open(
    "simplecache::gzip::https://example.com/data.csv.gz",
    compression='gzip'
) as f:
    df = pd.read_csv(f)

# Other useful wrappers:
# - "filecache::" - Persistent disk cache
# - "gzip::" - Decompression
# - "zip::" - Zip file access

Advanced S3 Features

import s3fs

# Detailed S3 configuration
fs = s3fs.S3FileSystem(
    key='AKIA...',
    secret='...',
    token='...',              # Temporary session token
    client_kwargs={
        'region_name': 'us-east-1',
        'endpoint_url': 'https://s3-compatible.local',  # MinIO, etc.
    },
    config_kwargs={
        'max_pool_connections': 50,
        'retries': {'max_attempts': 5}
    },
    skip_instance_cache=True   # Don't cache bucket listings
)

# Async operations
import asyncio

async def read_multiple():
    fs = s3fs.S3FileSystem(asynchronous=True)
    await fs.set_session()  # Establish async session

    # Concurrent reads (use _cat_file for bytes)
    data = await asyncio.gather(
        fs._cat_file('bucket/file1.parquet'),
        fs._cat_file('bucket/file2.parquet'),
        fs._cat_file('bucket/file3.parquet')
    )
    return data

# S3-specific features
fs.find('my-bucket', prefix='data/2024')  # List with prefix
fs.du('my-bucket/data')                   # Disk usage
fs.rm('my-bucket/temp/', recursive=True)  # Recursive delete

When to Use fsspec

Choose fsspec when:

  • You need broad ecosystem compatibility (pandas, xarray, Dask)
  • Working with multiple storage backends (S3, GCS, Azure, HTTP)
  • You need protocol chaining and caching features
  • Your workflow involves diverse data formats beyond Parquet

Performance Considerations

  • ✅ Use filecache:: instead of simplecache:: for persistent caching across sessions
  • ✅ Increase max_pool_connections for high concurrency
  • ✅ Use async API for many concurrent small file operations
  • ⚠️ For pure Parquet workflows with high throughput, consider pyarrow.fs instead
  • ⚠️ For maximum performance on large concurrent operations, consider obstore

PyArrow Filesystem Guide

PyArrow provides native filesystem integration optimized for Arrow and Parquet workflows.

Installation

pip install pyarrow

Basic Usage

import pyarrow.fs as fs
import pyarrow.parquet as pq
import pyarrow.dataset as ds

# Create filesystem instances
s3_fs = fs.S3FileSystem(region='us-east-1')
gcs_fs = fs.GcsFileSystem()
local_fs = fs.LocalFileSystem()

# Read Parquet with column pruning
table = pq.read_table(
    "bucket/data.parquet",
    filesystem=s3_fs,
    columns=["id", "value"]  # Only read needed columns
)

# Dataset scanning with predicate pushdown
dataset = ds.dataset(
    "bucket/dataset/",
    filesystem=s3_fs,
    partitioning=ds.HivePartitioning.discover()
)

# Filter at storage layer
table = dataset.to_table(
    filter=(ds.field("year") == 2024) & (ds.field("value") > 100),
    columns=["id", "value"]
)

When to Use pyarrow.fs

Choose pyarrow.fs when:

  • Your pipeline is Arrow/Parquet-native
  • You need zero-copy integration with PyArrow datasets
  • Predicate pushdown and column pruning are critical
  • Working with partitioned Parquet datasets

Performance Considerations

  • ✅ Excellent for Parquet workflows with high throughput
  • ✅ Zero-copy data transfer with Arrow-native tools
  • ✅ Efficient predicate pushdown and column pruning
  • ⚠️ Limited async support compared to obstore
  • ⚠️ Fewer protocol options than fsspec

obstore Library Guide

obstore is a high-performance Rust-based library for cloud storage access with native async support.

Installation

pip install obstore

Basic Usage

import obstore as obs
from obstore.store import S3Store, GCSStore, AzureStore

# Create store instances
s3_store = S3Store(bucket='my-bucket', region='us-east-1')
gcs_store = GCSStore(bucket='my-bucket')
azure_store = AzureStore(container='my-container', account='myaccount')

# Get object bytes
data = obs.get(s3_store, 'path/to/file.parquet').bytes()

# List objects
objects = obs.list(s3_store, prefix='data/2024')
for obj in objects:
    print(obj["path"], obj["size"])

# Put object
obs.put(s3_store, 'output/data.parquet', data_bytes)

Async Operations

import asyncio
import obstore as obs

async def fetch_multiple():
    store = S3Store(bucket='my-bucket', region='us-east-1')
    
    # Concurrent fetches
    results = await asyncio.gather(
        obs.get_async(store, 'file1.parquet'),
        obs.get_async(store, 'file2.parquet'),
        obs.get_async(store, 'file3.parquet')
    )
    return results

# Run async
results = asyncio.run(fetch_multiple())

When to Use obstore

Choose obstore when:

  • Performance is paramount (many small files, high concurrency)
  • You need async/await support for concurrent operations
  • You want minimal dependencies (Rust-based)
  • Working with large-scale data ingestion/egestion

Performance Considerations

  • 9x faster for concurrent operations vs fsspec
  • ✅ Native sync/async support
  • ✅ Zero Python dependencies
  • ✅ Rust-based implementation
  • ⚠️ Newer library (2025), rapidly evolving
  • ⚠️ Smaller ecosystem than fsspec

DataFrame Integration

DataFrame libraries provide high-level abstractions for cloud storage I/O. This section covers integration patterns for Polars, DuckDB, Pandas, and PyArrow.

Quick Comparison

Framework Integration Approach Best For
Polars Native cloud URIs (s3://) + fsspec/PyArrow bridges High-performance, lazy evaluation
DuckDB HTTPFS extension + SQL interface Analytical queries, SQL workflows
Pandas fsspec auto-detection Simple workflows, broad compatibility
PyArrow Native filesystem + dataset scanning Arrow-native pipelines, batch processing

When to Use Which?

  • Polars: Best for high-performance data pipelines with lazy evaluation, predicate pushdown, and memory efficiency
  • DuckDB: Best for SQL-centric workflows, analytical queries on remote data without loading into memory
  • Pandas: Best for simple scripts, small-to-medium data, maximum ecosystem compatibility
  • PyArrow: Best for Arrow-native workflows, batch processing, and as a foundation for other tools

Polars

Polars provides native cloud storage support via the Rust object_store crate, plus fsspec and PyArrow integration for broader compatibility.

Key approaches:

  • Native URIs: Direct s3://, gs://, az:// support (recommended)
  • fsspec bridge: For protocol chaining and caching
  • PyArrow dataset: For Hive-partitioned datasets with complex pushdown
import polars as pl

# Native cloud URIs (simplest, best performance)
df = pl.read_parquet("s3://bucket/data.parquet")
lazy_df = pl.scan_parquet("s3://bucket/dataset/**/*.parquet")

# Lazy evaluation with predicate pushdown
result = (
    lazy_df
    .filter(pl.col("date") > "2024-01-01")  # Pushed to storage layer
    .select(["id", "value"])
    .collect()
)

# Write to cloud storage
df.write_parquet("s3://bucket/output/data.parquet")

# Partitioned write (Hive-style via PyArrow)
df.write_parquet(
    "s3://bucket/output/",
    partition_by=["year", "month"],
    use_pyarrow=True
)

fsspec bridge for caching:

import fsspec

# Cache wrapper for repeated access
cached_fs = fsspec.filesystem(
    "simplecache",
    target_protocol="s3"
)
df = pl.read_parquet("simplecache::s3://bucket/cached.parquet")

See: @building-data-pipelines for Polars fundamentals.


DuckDB

DuckDB's HTTPFS extension enables direct SQL queries on remote files without loading entire datasets into memory.

import duckdb

con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")

# Configure credentials (or use environment variables)
con.execute("SET s3_region='us-east-1';")

# Query Parquet directly from S3
df = con.sql("""
    SELECT category, SUM(value) as total
    FROM read_parquet('s3://bucket/data/*.parquet')
    WHERE date >= '2024-01-01'
    GROUP BY category
""").pl()

# Copy operations
con.sql("""
    COPY (SELECT * FROM my_table)
    TO 's3://bucket/output.parquet'
    (FORMAT PARQUET)
""")

Environment-based auth (recommended):

import os
os.environ['AWS_REGION'] = 'us-east-1'
# DuckDB reads AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY automatically

Delta Lake integration:

con.execute("INSTALL delta; LOAD delta;")
df = con.sql("SELECT * FROM delta_scan('s3://bucket/delta-table/')").pl()

See: @building-data-pipelines for DuckDB fundamentals.


Pandas

Pandas leverages fsspec for automatic cloud URI handling, making remote files transparent to use.

import pandas as pd

# Auto-detection via fsspec
df = pd.read_parquet("s3://bucket/data.parquet")
df = pd.read_csv("s3://bucket/data.csv.gz")  # Compression auto-detected

# Explicit filesystem for control
import fsspec
fs = fsspec.filesystem("s3")
df = pd.read_parquet(
    "s3://bucket/data.parquet",
    filesystem=fs,
    columns=["id", "value"],  # Column pruning
    filters=[("date", ">=", "2024-01-01")]  # Row group filtering
)

# Partitioned writes
df.to_parquet(
    "s3://bucket/output/",
    partition_cols=["year", "month"],
    filesystem=fs
)

PyArrow filesystem for better performance:

import pyarrow.fs as fs
s3_fs = fs.S3FileSystem(region="us-east-1")
df = pd.read_parquet("bucket/data.parquet", filesystem=s3_fs)

See: @building-data-pipelines for pandas alternatives (Polars recommended for large data).


PyArrow

PyArrow provides the foundation for many DataFrame libraries with native filesystem integration and efficient dataset scanning.

import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow.fs as fs

# Native filesystem
s3_fs = fs.S3FileSystem(region="us-east-1")

# Read with column pruning
table = pq.read_table(
    "bucket/file.parquet",
    filesystem=s3_fs,
    columns=["id", "value"]
)

# Dataset with predicate pushdown
dataset = ds.dataset(
    "bucket/dataset/",
    filesystem=s3_fs,
    partitioning=ds.HivePartitioning.discover()
)

# Filter at storage layer
table = dataset.to_table(
    filter=(ds.field("year") == 2024) & (ds.field("value") > 100),
    columns=["id", "value"]
)

# Batch scanning for large datasets
scanner = dataset.scanner(
    filter=ds.field("value") > 0,
    batch_size=65536
)
for batch in scanner.to_batches():
    process(batch)

fsspec bridge:

import fsspec
fs = fsspec.filesystem("s3")
with fs.open("s3://bucket/file.parquet", "rb") as f:
    table = pq.read_table(f)

See: @building-data-pipelines for PyArrow fundamentals.


Format Considerations

For detailed information on storage formats (Parquet, Arrow, Lance, Zarr, Avro, ORC) and lakehouse table formats (Delta Lake, Iceberg, Hudi), including compression, schema evolution, and format selection guidance, see @designing-data-storage. This section focuses on I/O patterns, not format internals.


Authentication

All three libraries follow standard cloud authentication patterns: explicit credentials → environment variables → config files → IAM roles/Managed Identities.

Performance Optimization

Key strategies:

  • Caching: fsspec's SimpleCache for repeated access
  • Concurrency: obstore async API for many small files
  • Predicate pushdown: Filter at storage layer using partitioning
  • Column pruning: Read only required columns

See: performance.md in this skill for detailed guidance.


References

Weekly Installs
1
First Seen
3 days ago
Installed on
mcpjam1
claude-code1
replit1
junie1
windsurf1
zencoder1