dask
SKILL.md
Dask Parallel and Distributed Computing
Scale pandas/NumPy workflows beyond memory and across clusters.
When to Use
- Datasets exceed available RAM
- Need to parallelize pandas or NumPy operations
- Processing multiple files efficiently (CSVs, Parquet)
- Building custom parallel workflows
- Distributing workloads across multiple cores/machines
Dask Collections
| Collection | Like | Use Case |
|---|---|---|
| DataFrame | pandas | Tabular data, CSV/Parquet |
| Array | NumPy | Numerical arrays, matrices |
| Bag | list | Unstructured data, JSON logs |
| Delayed | Custom | Arbitrary Python functions |
Key concept: All collections are lazy—computation happens only when you call .compute().
Lazy Evaluation
| Function | Behavior | Use |
|---|---|---|
dd.read_csv() |
Lazy load | Large CSVs |
dd.read_parquet() |
Lazy load | Large Parquet |
| Operations | Build graph | Chain transforms |
.compute() |
Execute | Get final result |
Key concept: Dask builds a task graph of operations, optimizes it, then executes in parallel. Call .compute() once at the end, not after every operation.
Schedulers
| Scheduler | Best For | Start |
|---|---|---|
| threaded | NumPy/Pandas (releases GIL) | Default |
| processes | Pure Python (GIL bound) | scheduler='processes' |
| synchronous | Debugging | scheduler='synchronous' |
| distributed | Monitoring, scaling, clusters | Client() |
Distributed Scheduler
| Feature | Benefit |
|---|---|
| Dashboard | Real-time progress monitoring |
| Cluster scaling | Add/remove workers |
| Fault tolerance | Retry failed tasks |
| Worker resources | Memory management |
Chunking Concepts
DataFrame Partitions
| Concept | Description |
|---|---|
| Partition | Subset of rows (like a mini DataFrame) |
| npartitions | Number of partitions |
| divisions | Index boundaries between partitions |
Array Chunks
| Concept | Description |
|---|---|
| Chunk | Subset of array (n-dimensional block) |
| chunks | Tuple of chunk sizes per dimension |
| Optimal size | ~100 MB per chunk |
Key concept: Chunk size is critical. Too small = scheduling overhead. Too large = memory issues. Target ~100 MB.
DataFrame Operations
Supported (parallel)
| Category | Operations |
|---|---|
| Selection | filter, loc, column selection |
| Aggregation | groupby, sum, mean, count |
| Transforms | apply (row-wise), map_partitions |
| Joins | merge, join (shuffles data) |
| I/O | read_csv, read_parquet, to_parquet |
Avoid or Use Carefully
| Operation | Issue | Alternative |
|---|---|---|
iterrows |
Kills parallelism | map_partitions |
apply(axis=1) |
Slow | map_partitions |
Repeated compute() |
Inefficient | Single compute() at end |
sort_values |
Expensive shuffle | Avoid if possible |
Common Patterns
ETL Pipeline
scan_*orread_*(lazy load)- Chain filters and transforms
- Single
.compute()or.to_parquet()
Multi-File Processing
| Pattern | Description |
|---|---|
| Glob patterns | dd.read_csv('data/*.csv') |
| Partition per file | Natural parallelism |
| Output partitioned | to_parquet('output/') |
Custom Operations
| Method | Use Case |
|---|---|
map_partitions |
Apply function to each partition |
map_blocks |
Apply function to each array block |
delayed |
Wrap arbitrary Python functions |
Best Practices
| Practice | Why |
|---|---|
| Don't load locally first | Let Dask handle loading |
| Single compute() at end | Avoid redundant computation |
| Use Parquet | Faster than CSV, columnar |
| Match partition to files | One partition per file |
| Check task graph size | len(ddf.__dask_graph__()) < 100k |
| Use distributed for debugging | Dashboard shows progress |
Common Pitfalls
| Pitfall | Solution |
|---|---|
| Loading with pandas first | Use dd.read_* directly |
| compute() in loops | Collect all, single compute() |
| Too many partitions | Repartition to ~100 MB each |
| Memory errors | Reduce chunk size, add workers |
| Slow shuffles | Avoid sorts/joins when possible |
vs Alternatives
| Tool | Best For | Trade-off |
|---|---|---|
| Dask | Scale pandas/NumPy, clusters | Setup complexity |
| Polars | Fast in-memory | Must fit in RAM |
| Vaex | Out-of-core single machine | Limited operations |
| Spark | Enterprise, SQL-heavy | Infrastructure |
Resources
- Docs: https://docs.dask.org/
- Best Practices: https://docs.dask.org/en/stable/best-practices.html
- Examples: https://examples.dask.org/
Weekly Installs
31
Repository
eyadsibai/ltkFirst Seen
Jan 28, 2026
Security Audits
Installed on
gemini-cli26
opencode24
github-copilot23
codex23
claude-code21
antigravity20