Dask Parallel and Distributed Computing
Scale pandas/NumPy workflows beyond memory and across clusters.
When to Use
- Datasets exceed available RAM
- Need to parallelize pandas or NumPy operations
- Processing multiple files efficiently (CSVs, Parquet)
- Building custom parallel workflows
- Distributing workloads across multiple cores/machines
Dask Collections
| Collection |
Like |
Use Case |
| DataFrame |
pandas |
Tabular data, CSV/Parquet |
| Array |
NumPy |
Numerical arrays, matrices |
| Bag |
list |
Unstructured data, JSON logs |
| Delayed |
Custom |
Arbitrary Python functions |
Key concept: All collections are lazy—computation happens only when you call .compute().
Lazy Evaluation
| Function |
Behavior |
Use |
dd.read_csv() |
Lazy load |
Large CSVs |
dd.read_parquet() |
Lazy load |
Large Parquet |
| Operations |
Build graph |
Chain transforms |
.compute() |
Execute |
Get final result |
Key concept: Dask builds a task graph of operations, optimizes it, then executes in parallel. Call .compute() once at the end, not after every operation.
Schedulers
| Scheduler |
Best For |
Start |
| threaded |
NumPy/Pandas (releases GIL) |
Default |
| processes |
Pure Python (GIL bound) |
scheduler='processes' |
| synchronous |
Debugging |
scheduler='synchronous' |
| distributed |
Monitoring, scaling, clusters |
Client() |
Distributed Scheduler
| Feature |
Benefit |
| Dashboard |
Real-time progress monitoring |
| Cluster scaling |
Add/remove workers |
| Fault tolerance |
Retry failed tasks |
| Worker resources |
Memory management |
Chunking Concepts
DataFrame Partitions
| Concept |
Description |
| Partition |
Subset of rows (like a mini DataFrame) |
| npartitions |
Number of partitions |
| divisions |
Index boundaries between partitions |
Array Chunks
| Concept |
Description |
| Chunk |
Subset of array (n-dimensional block) |
| chunks |
Tuple of chunk sizes per dimension |
| Optimal size |
~100 MB per chunk |
Key concept: Chunk size is critical. Too small = scheduling overhead. Too large = memory issues. Target ~100 MB.
DataFrame Operations
Supported (parallel)
| Category |
Operations |
| Selection |
filter, loc, column selection |
| Aggregation |
groupby, sum, mean, count |
| Transforms |
apply (row-wise), map_partitions |
| Joins |
merge, join (shuffles data) |
| I/O |
read_csv, read_parquet, to_parquet |
Avoid or Use Carefully
| Operation |
Issue |
Alternative |
iterrows |
Kills parallelism |
map_partitions |
apply(axis=1) |
Slow |
map_partitions |
Repeated compute() |
Inefficient |
Single compute() at end |
sort_values |
Expensive shuffle |
Avoid if possible |
Common Patterns
ETL Pipeline
scan_* or read_* (lazy load)
- Chain filters and transforms
- Single
.compute() or .to_parquet()
Multi-File Processing
| Pattern |
Description |
| Glob patterns |
dd.read_csv('data/*.csv') |
| Partition per file |
Natural parallelism |
| Output partitioned |
to_parquet('output/') |
Custom Operations
| Method |
Use Case |
map_partitions |
Apply function to each partition |
map_blocks |
Apply function to each array block |
delayed |
Wrap arbitrary Python functions |
Best Practices
| Practice |
Why |
| Don't load locally first |
Let Dask handle loading |
| Single compute() at end |
Avoid redundant computation |
| Use Parquet |
Faster than CSV, columnar |
| Match partition to files |
One partition per file |
| Check task graph size |
len(ddf.__dask_graph__()) < 100k |
| Use distributed for debugging |
Dashboard shows progress |
Common Pitfalls
| Pitfall |
Solution |
| Loading with pandas first |
Use dd.read_* directly |
| compute() in loops |
Collect all, single compute() |
| Too many partitions |
Repartition to ~100 MB each |
| Memory errors |
Reduce chunk size, add workers |
| Slow shuffles |
Avoid sorts/joins when possible |
vs Alternatives
| Tool |
Best For |
Trade-off |
| Dask |
Scale pandas/NumPy, clusters |
Setup complexity |
| Polars |
Fast in-memory |
Must fit in RAM |
| Vaex |
Out-of-core single machine |
Limited operations |
| Spark |
Enterprise, SQL-heavy |
Infrastructure |
Resources