data-engineering-storage-remote-access-integrations-polars
SKILL.md
Polars Integration with Remote Storage
Polars has native cloud storage support via multiple backends, plus integration with fsspec and PyArrow filesystems.
Native Cloud Access (object_store)
Polars uses the Rust object_store crate internally for direct cloud URI access:
import polars as pl
# Read from cloud URIs directly (s3://, gs://, az://)
df = pl.read_parquet("s3://bucket/data/file.parquet")
df = pl.read_parquet("gs://bucket/data/file.parquet")
df = pl.read_csv("s3://bucket/data/file.csv.gz", infer_schema_length=1000)
# Lazy scanning with predicate and column pushdown
lazy_df = pl.scan_parquet("s3://bucket/dataset/**/*.parquet")
result = (
lazy_df
.filter(pl.col("date") > "2024-01-01") # Pushed to storage layer
.group_by("category")
.agg([
pl.col("value").sum().alias("total_value"),
pl.col("id").count().alias("count")
])
.collect()
)
# Write to cloud storage
df.write_parquet("s3://bucket/output/data.parquet")
# Partitioned write (Hive-style)
df.write_parquet(
"s3://bucket/output/",
partition_by=["year", "month"],
use_pyarrow=True # Requires PyArrow
)
Supported protocols: s3://, gs://, az://, file://
Via fsspec
Use fsspec for broader compatibility and protocol chaining:
import fsspec
import polars as pl
# Create fsspec filesystem
fs = fsspec.filesystem("s3", config_kwargs={"region": "us-east-1"})
# Open file through fsspec
with fs.open("s3://bucket/data.csv") as f:
df = pl.read_csv(f)
# Use fsspec caching wrapper
cached_fs = fsspec.filesystem(
"simplecache",
target_protocol="s3",
target_options={"anon": False}
)
df = pl.read_parquet("simplecache::s3://bucket/cached.parquet")
Via PyArrow Dataset (Advanced)
For Hive-partitioned datasets with complex pushdown:
import pyarrow.fs as fs
import pyarrow.dataset as ds
import polars as pl
s3_fs = fs.S3FileSystem(region="us-east-1")
# Load partitioned dataset
dataset = ds.dataset(
"bucket/dataset/",
filesystem=s3_fs,
format="parquet",
partitioning=ds.HivePartitioning.discover()
)
# Convert to Polars lazy frame
lazy_df = pl.scan_pyarrow_dataset(dataset)
# Query with full pushdown
result = (
lazy_df
.filter((pl.col("year") == 2024) & (pl.col("month") <= 6))
.select(["id", "value", "timestamp"])
.collect()
)
Authentication
Native Polars cloud access inherits credentials from:
- AWS: Environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY),~/.aws/credentials, IAM roles - GCP:
GOOGLE_APPLICATION_CREDENTIALS, gcloud CLI, metadata server - Azure:
AZURE_STORAGE_ACCOUNT,AZURE_STORAGE_KEY, managed identity
For explicit credentials, use fsspec or PyArrow filesystem constructors.
Performance Tips
- ✅ Use native
s3://URIs for best performance (direct object_store usage) - ✅ Lazy evaluation with predicates for pushdown
- ✅ Partitioned writes for large datasets (avoid huge single files)
- ✅ Column selection in lazy queries to read only needed data
- ⚠️ For complex authentication (SSO, temporary creds), use fsspec/ PyArrow constructors
- ⚠️ For caching, use fsspec's
simplecache::orfilecache::wrappers
Common Patterns
Incremental Load from Partitioned Data
# Only read recent partitions
lazy_df = pl.scan_parquet("s3://bucket/events/")
last_month = datetime.now() - timedelta(days=30)
result = (
lazy_df
.filter(pl.col("date") >= last_month)
.collect()
)
Cross-Cloud Copy
# Read from S3, write to GCS (Polars doesn't support mixed URIs directly)
# Use PyArrow bridge:
import pyarrow.fs as fs
import pyarrow.dataset as ds
s3 = fs.S3FileSystem()
gcs = fs.GcsFileSystem()
dataset = ds.dataset("s3://bucket/input/", filesystem=s3, format="parquet")
table = dataset.to_table()
gcs_file = fs.GcsFileSystem().open_output_stream("gs://bucket/output.parquet")
pq.write_table(table, gcs_file)
References
- Polars Cloud Storage Guide
- Polars File System Backends
@data-engineering-storage-remote-access/libraries/fsspec- fsspec usage@data-engineering-storage-remote-access/libraries/pyarrow-fs- PyArrow filesystem
Weekly Installs
7
Repository
legout/data-pla…t-skillsFirst Seen
Feb 11, 2026
Security Audits
Installed on
pi6
opencode5
gemini-cli5
claude-code5
github-copilot5
codex5