Polars Fast DataFrame Library

Lightning-fast DataFrame library with lazy evaluation and parallel execution.

When to Use

Pandas is too slow for your dataset
Working with 1-100GB datasets that fit in RAM
Need lazy evaluation for query optimization
Building ETL pipelines
Want parallel execution without extra config

Lazy vs Eager Evaluation

Mode	Function	Executes	Use Case
Eager	`read_csv()`	Immediately	Small data, exploration
Lazy	`scan_csv()`	On `.collect()`	Large data, pipelines

Key concept: Lazy mode builds a query plan that gets optimized before execution. The optimizer applies predicate pushdown (filter early) and projection pushdown (select columns early).

Core Operations

Data Selection

Operation	Purpose
`select()`	Choose columns
`filter()`	Choose rows by condition
`with_columns()`	Add/modify columns
`drop()`	Remove columns
`head(n)` / `tail(n)`	First/last n rows

Aggregation

Operation	Purpose
`group_by().agg()`	Group and aggregate
`pivot()`	Reshape wide
`melt()`	Reshape long
`unique()`	Distinct values

Joins

Join Type	Description
inner	Matching rows only
left	All left + matching right
outer	All rows from both
cross	Cartesian product
semi	Left rows with match
anti	Left rows without match

Expression API

Key concept: Polars uses expressions (pl.col()) instead of indexing. Expressions are lazily evaluated and optimized.

Common Expressions

Expression	Purpose
`pl.col("name")`	Reference column
`pl.lit(value)`	Literal value
`pl.all()`	All columns
`pl.exclude(...)`	All except

Expression Methods

Category	Methods
Aggregation	`.sum()`, `.mean()`, `.min()`, `.max()`, `.count()`
String	`.str.contains()`, `.str.replace()`, `.str.to_lowercase()`
DateTime	`.dt.year()`, `.dt.month()`, `.dt.day()`
Conditional	`.when().then().otherwise()`
Window	`.over()`, `.rolling_mean()`, `.shift()`

Pandas Migration

Pandas	Polars
`df['col']`	`df.select('col')`
`df[df['col'] > 5]`	`df.filter(pl.col('col') > 5)`
`df['new'] = df['col'] * 2`	`df.with_columns((pl.col('col') * 2).alias('new'))`
`df.groupby('col').mean()`	`df.group_by('col').agg(pl.all().mean())`
`df.apply(func)`	`df.map_rows(func)` (avoid if possible)

Key concept: Polars prefers explicit operations over implicit indexing. Use .alias() to name computed columns.

File I/O

Format	Read	Write	Notes
CSV	`read_csv()` / `scan_csv()`	`write_csv()`	Human readable
Parquet	`read_parquet()` / `scan_parquet()`	`write_parquet()`	Fast, compressed
JSON	`read_json()` / `scan_ndjson()`	`write_json()`	Newline-delimited
IPC/Arrow	`read_ipc()` / `scan_ipc()`	`write_ipc()`	Zero-copy

Key concept: Use Parquet for performance. Use scan_* for large files to enable lazy optimization.

Performance Tips

Tip	Why
Use lazy mode	Query optimization
Use Parquet	Column-oriented, compressed
Select columns early	Projection pushdown
Filter early	Predicate pushdown
Avoid Python UDFs	Breaks parallelism
Use expressions	Vectorized operations
Set dtypes on read	Avoid inference overhead

vs Alternatives

Tool	Best For	Limitations
Polars	1-100GB, speed critical	Must fit in RAM
Pandas	Small data, ecosystem	Slow, memory hungry
Dask	Larger than RAM	More complex API
Spark	Cluster computing	Infrastructure overhead
DuckDB	SQL interface	Different API style

Resources

Docs: https://pola.rs/
User Guide: https://docs.pola.rs/user-guide/
Cookbook: https://docs.pola.rs/user-guide/misc/cookbook/

polars