data-engineering-storage-remote-access-integrations-polars
Polars Integration with Remote Storage
Polars has native cloud storage support via multiple backends, plus integration with fsspec and PyArrow filesystems.
Native Cloud Access (object_store)
Polars uses the Rust object_store crate internally for direct cloud URI access:
import polars as pl
# Read from cloud URIs directly (s3://, gs://, az://)
df = pl.read_parquet("s3://bucket/data/file.parquet")
df = pl.read_parquet("gs://bucket/data/file.parquet")
df = pl.read_csv("s3://bucket/data/file.csv.gz", infer_schema_length=1000)
# Lazy scanning with predicate and column pushdown
lazy_df = pl.scan_parquet("s3://bucket/dataset/**/*.parquet")
result = (
lazy_df
.filter(pl.col("date") > "2024-01-01") # Pushed to storage layer
.group_by("category")
.agg([
pl.col("value").sum().alias("total_value"),
pl.col("id").count().alias("count")
])
.collect()
)
# Write to cloud storage
df.write_parquet("s3://bucket/output/data.parquet")
# Partitioned write (Hive-style)
df.write_parquet(
"s3://bucket/output/",
partition_by=["year", "month"],
use_pyarrow=True # Requires PyArrow
)
Supported protocols: s3://, gs://, az://, file://
Via fsspec
Use fsspec for broader compatibility and protocol chaining:
import fsspec
import polars as pl
# Create fsspec filesystem
fs = fsspec.filesystem("s3", config_kwargs={"region": "us-east-1"})
# Open file through fsspec
with fs.open("s3://bucket/data.csv") as f:
df = pl.read_csv(f)
# Use fsspec caching wrapper
cached_fs = fsspec.filesystem(
"simplecache",
target_protocol="s3",
target_options={"anon": False}
)
df = pl.read_parquet("simplecache::s3://bucket/cached.parquet")
Via PyArrow Dataset (Advanced)
For Hive-partitioned datasets with complex pushdown:
import pyarrow.fs as fs
import pyarrow.dataset as ds
import polars as pl
s3_fs = fs.S3FileSystem(region="us-east-1")
# Load partitioned dataset
dataset = ds.dataset(
"bucket/dataset/",
filesystem=s3_fs,
format="parquet",
partitioning=ds.HivePartitioning.discover()
)
# Convert to Polars lazy frame
lazy_df = pl.scan_pyarrow_dataset(dataset)
# Query with full pushdown
result = (
lazy_df
.filter((pl.col("year") == 2024) & (pl.col("month") <= 6))
.select(["id", "value", "timestamp"])
.collect()
)
Authentication
Native Polars cloud access inherits credentials from:
- AWS: Environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY),~/.aws/credentials, IAM roles - GCP:
GOOGLE_APPLICATION_CREDENTIALS, gcloud CLI, metadata server - Azure:
AZURE_STORAGE_ACCOUNT,AZURE_STORAGE_KEY, managed identity
For explicit credentials, use fsspec or PyArrow filesystem constructors.
Performance Tips
- ✅ Use native
s3://URIs for best performance (direct object_store usage) - ✅ Lazy evaluation with predicates for pushdown
- ✅ Partitioned writes for large datasets (avoid huge single files)
- ✅ Column selection in lazy queries to read only needed data
- ⚠️ For complex authentication (SSO, temporary creds), use fsspec/ PyArrow constructors
- ⚠️ For caching, use fsspec's
simplecache::orfilecache::wrappers
Common Patterns
Incremental Load from Partitioned Data
# Only read recent partitions
lazy_df = pl.scan_parquet("s3://bucket/events/")
last_month = datetime.now() - timedelta(days=30)
result = (
lazy_df
.filter(pl.col("date") >= last_month)
.collect()
)
Cross-Cloud Copy
# Read from S3, write to GCS (Polars doesn't support mixed URIs directly)
# Use PyArrow bridge:
import pyarrow.fs as fs
import pyarrow.dataset as ds
s3 = fs.S3FileSystem()
gcs = fs.GcsFileSystem()
dataset = ds.dataset("s3://bucket/input/", filesystem=s3, format="parquet")
table = dataset.to_table()
gcs_file = fs.GcsFileSystem().open_output_stream("gs://bucket/output.parquet")
pq.write_table(table, gcs_file)
References
- Polars Cloud Storage Guide
- Polars File System Backends
@data-engineering-storage-remote-access/libraries/fsspec- fsspec usage@data-engineering-storage-remote-access/libraries/pyarrow-fs- PyArrow filesystem
More from legout/data-platform-agent-skills
data-science-eda
Exploratory Data Analysis (EDA): profiling, visualization, correlation analysis, and data quality checks. Use when understanding dataset structure, distributions, relationships, or preparing for feature engineering and modeling.
13data-science-visualization
Data visualization for Python: Matplotlib, Seaborn, Plotly, Altair, hvPlot/HoloViz, and Bokeh. Use when creating exploratory charts, interactive dashboards, publication-quality figures, or choosing the right library for your data and audience.
12data-engineering-core
Core Python data engineering: Polars, DuckDB, PyArrow, PostgreSQL, ETL patterns, performance tuning, and resilient pipeline construction. Use when building or reviewing batch ETL/dataframe/SQL pipelines in Python.
10data-science-feature-engineering
Feature engineering for machine learning: encoding, scaling, transformations, datetime features, text features, and feature selection. Use when preparing data for modeling or improving model performance through better representations.
10data-science-notebooks
Interactive notebooks for data science: Jupyter, JupyterLab, and marimo. Use for exploratory analysis, reproducible research, documentation, and sharing insights with stakeholders.
9data-engineering-best-practices
Data engineering best practices: medallion architecture, dataset lifecycle, partitioning, file sizing, schema evolution, and append/overwrite/merge patterns across Polars, PyArrow, DuckDB, Delta Lake, and Iceberg. Use when designing production data pipelines or reviewing data platform decisions.
8