data-engineering-storage-remote-access-integrations-polars
Polars Integration with Remote Storage
Polars has native cloud storage support via multiple backends, plus integration with fsspec and PyArrow filesystems.
Native Cloud Access (object_store)
Polars uses the Rust object_store crate internally for direct cloud URI access:
import polars as pl
# Read from cloud URIs directly (s3://, gs://, az://)
df = pl.read_parquet("s3://bucket/data/file.parquet")
df = pl.read_parquet("gs://bucket/data/file.parquet")
df = pl.read_csv("s3://bucket/data/file.csv.gz", infer_schema_length=1000)
# Lazy scanning with predicate and column pushdown
lazy_df = pl.scan_parquet("s3://bucket/dataset/**/*.parquet")
result = (
lazy_df
.filter(pl.col("date") > "2024-01-01") # Pushed to storage layer
.group_by("category")
.agg([
pl.col("value").sum().alias("total_value"),
pl.col("id").count().alias("count")
])
.collect()
)
# Write to cloud storage
df.write_parquet("s3://bucket/output/data.parquet")
# Partitioned write (Hive-style)
df.write_parquet(
"s3://bucket/output/",
partition_by=["year", "month"],
use_pyarrow=True # Requires PyArrow
)
Supported protocols: s3://, gs://, az://, file://
Via fsspec
Use fsspec for broader compatibility and protocol chaining:
import fsspec
import polars as pl
# Create fsspec filesystem
fs = fsspec.filesystem("s3", config_kwargs={"region": "us-east-1"})
# Open file through fsspec
with fs.open("s3://bucket/data.csv") as f:
df = pl.read_csv(f)
# Use fsspec caching wrapper
cached_fs = fsspec.filesystem(
"simplecache",
target_protocol="s3",
target_options={"anon": False}
)
df = pl.read_parquet("simplecache::s3://bucket/cached.parquet")
Via PyArrow Dataset (Advanced)
For Hive-partitioned datasets with complex pushdown:
import pyarrow.fs as fs
import pyarrow.dataset as ds
import polars as pl
s3_fs = fs.S3FileSystem(region="us-east-1")
# Load partitioned dataset
dataset = ds.dataset(
"bucket/dataset/",
filesystem=s3_fs,
format="parquet",
partitioning=ds.HivePartitioning.discover()
)
# Convert to Polars lazy frame
lazy_df = pl.scan_pyarrow_dataset(dataset)
# Query with full pushdown
result = (
lazy_df
.filter((pl.col("year") == 2024) & (pl.col("month") <= 6))
.select(["id", "value", "timestamp"])
.collect()
)
Authentication
Native Polars cloud access inherits credentials from:
- AWS: Environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY),~/.aws/credentials, IAM roles - GCP:
GOOGLE_APPLICATION_CREDENTIALS, gcloud CLI, metadata server - Azure:
AZURE_STORAGE_ACCOUNT,AZURE_STORAGE_KEY, managed identity
For explicit credentials, use fsspec or PyArrow filesystem constructors.
Performance Tips
- ✅ Use native
s3://URIs for best performance (direct object_store usage) - ✅ Lazy evaluation with predicates for pushdown
- ✅ Partitioned writes for large datasets (avoid huge single files)
- ✅ Column selection in lazy queries to read only needed data
- ⚠️ For complex authentication (SSO, temporary creds), use fsspec/ PyArrow constructors
- ⚠️ For caching, use fsspec's
simplecache::orfilecache::wrappers
Common Patterns
Incremental Load from Partitioned Data
# Only read recent partitions
lazy_df = pl.scan_parquet("s3://bucket/events/")
last_month = datetime.now() - timedelta(days=30)
result = (
lazy_df
.filter(pl.col("date") >= last_month)
.collect()
)
Cross-Cloud Copy
# Read from S3, write to GCS (Polars doesn't support mixed URIs directly)
# Use PyArrow bridge:
import pyarrow.fs as fs
import pyarrow.dataset as ds
s3 = fs.S3FileSystem()
gcs = fs.GcsFileSystem()
dataset = ds.dataset("s3://bucket/input/", filesystem=s3, format="parquet")
table = dataset.to_table()
gcs_file = fs.GcsFileSystem().open_output_stream("gs://bucket/output.parquet")
pq.write_table(table, gcs_file)
References
- Polars Cloud Storage Guide
- Polars File System Backends
@data-engineering-storage-remote-access/libraries/fsspec- fsspec usage@data-engineering-storage-remote-access/libraries/pyarrow-fs- PyArrow filesystem
More from legout/data-agent-skills
data-engineering-storage-remote-access-libraries-obstore
High-performance Rust-based remote filesystem library. Covers store creation, basic operations, async API, streaming uploads, Arrow integration, and fsspec compatibility wrapper.
4flowerpower
Create and manage data pipelines using the FlowerPower framework with Hamilton DAGs and uv. Lightweight orchestration for batch ETL, data transformation, and ML pipelines. Integrates with Delta Lake, DuckDB, Polars, and cloud storage.
4data-engineering-observability
Observability and monitoring for data pipelines using OpenTelemetry (traces) and Prometheus (metrics). Covers instrumentation, dashboards, and alerting.
4data-engineering-best-practices
Data engineering best practices: medallion architecture, dataset lifecycle, partitioning, file sizing, schema evolution, and append/overwrite/merge patterns across Polars, PyArrow, DuckDB, Delta Lake, and Iceberg. Use when designing production data pipelines or reviewing data platform decisions.
4data-engineering-core
Core Python data engineering: Polars, DuckDB, PyArrow, PostgreSQL, ETL patterns, performance tuning, and resilient pipeline construction. Use when building or reviewing batch ETL/dataframe/SQL pipelines in Python.
4data-engineering-storage-lakehouse
Lakehouse table formats: Delta Lake, Apache Iceberg, and Apache Hudi for ACID transactions, schema evolution, and time travel on data lakes.
4