Polars Fast DataFrame Library
Lightning-fast DataFrame library with lazy evaluation and parallel execution.
When to Use
- Pandas is too slow for your dataset
- Working with 1-100GB datasets that fit in RAM
- Need lazy evaluation for query optimization
- Building ETL pipelines
- Want parallel execution without extra config
Lazy vs Eager Evaluation
| Mode |
Function |
Executes |
Use Case |
| Eager |
read_csv() |
Immediately |
Small data, exploration |
| Lazy |
scan_csv() |
On .collect() |
Large data, pipelines |
Key concept: Lazy mode builds a query plan that gets optimized before execution. The optimizer applies predicate pushdown (filter early) and projection pushdown (select columns early).
Core Operations
Data Selection
| Operation |
Purpose |
select() |
Choose columns |
filter() |
Choose rows by condition |
with_columns() |
Add/modify columns |
drop() |
Remove columns |
head(n) / tail(n) |
First/last n rows |
Aggregation
| Operation |
Purpose |
group_by().agg() |
Group and aggregate |
pivot() |
Reshape wide |
melt() |
Reshape long |
unique() |
Distinct values |
Joins
| Join Type |
Description |
| inner |
Matching rows only |
| left |
All left + matching right |
| outer |
All rows from both |
| cross |
Cartesian product |
| semi |
Left rows with match |
| anti |
Left rows without match |
Expression API
Key concept: Polars uses expressions (pl.col()) instead of indexing. Expressions are lazily evaluated and optimized.
Common Expressions
| Expression |
Purpose |
pl.col("name") |
Reference column |
pl.lit(value) |
Literal value |
pl.all() |
All columns |
pl.exclude(...) |
All except |
Expression Methods
| Category |
Methods |
| Aggregation |
.sum(), .mean(), .min(), .max(), .count() |
| String |
.str.contains(), .str.replace(), .str.to_lowercase() |
| DateTime |
.dt.year(), .dt.month(), .dt.day() |
| Conditional |
.when().then().otherwise() |
| Window |
.over(), .rolling_mean(), .shift() |
Pandas Migration
| Pandas |
Polars |
df['col'] |
df.select('col') |
df[df['col'] > 5] |
df.filter(pl.col('col') > 5) |
df['new'] = df['col'] * 2 |
df.with_columns((pl.col('col') * 2).alias('new')) |
df.groupby('col').mean() |
df.group_by('col').agg(pl.all().mean()) |
df.apply(func) |
df.map_rows(func) (avoid if possible) |
Key concept: Polars prefers explicit operations over implicit indexing. Use .alias() to name computed columns.
File I/O
| Format |
Read |
Write |
Notes |
| CSV |
read_csv() / scan_csv() |
write_csv() |
Human readable |
| Parquet |
read_parquet() / scan_parquet() |
write_parquet() |
Fast, compressed |
| JSON |
read_json() / scan_ndjson() |
write_json() |
Newline-delimited |
| IPC/Arrow |
read_ipc() / scan_ipc() |
write_ipc() |
Zero-copy |
Key concept: Use Parquet for performance. Use scan_* for large files to enable lazy optimization.
Performance Tips
| Tip |
Why |
| Use lazy mode |
Query optimization |
| Use Parquet |
Column-oriented, compressed |
| Select columns early |
Projection pushdown |
| Filter early |
Predicate pushdown |
| Avoid Python UDFs |
Breaks parallelism |
| Use expressions |
Vectorized operations |
| Set dtypes on read |
Avoid inference overhead |
vs Alternatives
| Tool |
Best For |
Limitations |
| Polars |
1-100GB, speed critical |
Must fit in RAM |
| Pandas |
Small data, ecosystem |
Slow, memory hungry |
| Dask |
Larger than RAM |
More complex API |
| Spark |
Cluster computing |
Infrastructure overhead |
| DuckDB |
SQL interface |
Different API style |
Resources