skills/legout/data-platform-agent-skills/data-engineering-storage-remote-access-integrations-polars

data-engineering-storage-remote-access-integrations-polars

SKILL.md

Polars Integration with Remote Storage

Polars has native cloud storage support via multiple backends, plus integration with fsspec and PyArrow filesystems.

Native Cloud Access (object_store)

Polars uses the Rust object_store crate internally for direct cloud URI access:

import polars as pl

# Read from cloud URIs directly (s3://, gs://, az://)
df = pl.read_parquet("s3://bucket/data/file.parquet")
df = pl.read_parquet("gs://bucket/data/file.parquet")
df = pl.read_csv("s3://bucket/data/file.csv.gz", infer_schema_length=1000)

# Lazy scanning with predicate and column pushdown
lazy_df = pl.scan_parquet("s3://bucket/dataset/**/*.parquet")
result = (
    lazy_df
    .filter(pl.col("date") > "2024-01-01")  # Pushed to storage layer
    .group_by("category")
    .agg([
        pl.col("value").sum().alias("total_value"),
        pl.col("id").count().alias("count")
    ])
    .collect()
)

# Write to cloud storage
df.write_parquet("s3://bucket/output/data.parquet")

# Partitioned write (Hive-style)
df.write_parquet(
    "s3://bucket/output/",
    partition_by=["year", "month"],
    use_pyarrow=True  # Requires PyArrow
)

Supported protocols: s3://, gs://, az://, file://

Via fsspec

Use fsspec for broader compatibility and protocol chaining:

import fsspec
import polars as pl

# Create fsspec filesystem
fs = fsspec.filesystem("s3", config_kwargs={"region": "us-east-1"})

# Open file through fsspec
with fs.open("s3://bucket/data.csv") as f:
    df = pl.read_csv(f)

# Use fsspec caching wrapper
cached_fs = fsspec.filesystem(
    "simplecache",
    target_protocol="s3",
    target_options={"anon": False}
)
df = pl.read_parquet("simplecache::s3://bucket/cached.parquet")

Via PyArrow Dataset (Advanced)

For Hive-partitioned datasets with complex pushdown:

import pyarrow.fs as fs
import pyarrow.dataset as ds
import polars as pl

s3_fs = fs.S3FileSystem(region="us-east-1")

# Load partitioned dataset
dataset = ds.dataset(
    "bucket/dataset/",
    filesystem=s3_fs,
    format="parquet",
    partitioning=ds.HivePartitioning.discover()
)

# Convert to Polars lazy frame
lazy_df = pl.scan_pyarrow_dataset(dataset)

# Query with full pushdown
result = (
    lazy_df
    .filter((pl.col("year") == 2024) & (pl.col("month") <= 6))
    .select(["id", "value", "timestamp"])
    .collect()
)

Authentication

Native Polars cloud access inherits credentials from:

  • AWS: Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), ~/.aws/credentials, IAM roles
  • GCP: GOOGLE_APPLICATION_CREDENTIALS, gcloud CLI, metadata server
  • Azure: AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_KEY, managed identity

For explicit credentials, use fsspec or PyArrow filesystem constructors.

Performance Tips

  • Use native s3:// URIs for best performance (direct object_store usage)
  • Lazy evaluation with predicates for pushdown
  • Partitioned writes for large datasets (avoid huge single files)
  • Column selection in lazy queries to read only needed data
  • ⚠️ For complex authentication (SSO, temporary creds), use fsspec/ PyArrow constructors
  • ⚠️ For caching, use fsspec's simplecache:: or filecache:: wrappers

Common Patterns

Incremental Load from Partitioned Data

# Only read recent partitions
lazy_df = pl.scan_parquet("s3://bucket/events/")
last_month = datetime.now() - timedelta(days=30)

result = (
    lazy_df
    .filter(pl.col("date") >= last_month)
    .collect()
)

Cross-Cloud Copy

# Read from S3, write to GCS (Polars doesn't support mixed URIs directly)
# Use PyArrow bridge:
import pyarrow.fs as fs
import pyarrow.dataset as ds

s3 = fs.S3FileSystem()
gcs = fs.GcsFileSystem()

dataset = ds.dataset("s3://bucket/input/", filesystem=s3, format="parquet")
table = dataset.to_table()
gcs_file = fs.GcsFileSystem().open_output_stream("gs://bucket/output.parquet")
pq.write_table(table, gcs_file)

References

Weekly Installs
7
First Seen
Feb 11, 2026
Installed on
pi6
opencode5
gemini-cli5
claude-code5
github-copilot5
codex5