large-data-with-dask
SKILL.md
Large Data With Dask Skill
- Consider using dask for larger-than-memory datasets.
Iron Laws
- ALWAYS call
dask.compute()only once at the end of a pipeline — multiple intermediatecompute()calls break the lazy evaluation graph and eliminate Dask's ability to fuse and parallelize operations. - NEVER use
df.apply(lambda ...)with Dask DataFrames for element-wise operations — Pandas-styleapplyforces row-by-row Python execution that bypasses Dask's vectorized C extensions and is slower than single-threaded Pandas. - ALWAYS specify partition sizes explicitly when reading large datasets (
blocksize=for CSV,chunksize=for Parquet) — auto-detected partition sizes frequently produce thousands of tiny partitions (slow scheduler overhead) or a single giant partition (no parallelism). - NEVER call
len(df)ordf.shapeon a Dask DataFrame without wrapping incompute()— these trigger immediate full dataset computation and negate lazy evaluation. - ALWAYS use
dask.distributed.Clientfor multi-machine or CPU-bound workloads — the default threaded scheduler serializes Python-heavy operations due to the GIL; the distributed scheduler bypasses this.
Anti-Patterns
| Anti-Pattern | Why It Fails | Correct Approach |
|---|---|---|
Multiple compute() calls in pipeline |
Breaks lazy graph; forces data to materialize and re-partition at each call | Build complete computation graph first; call compute() once at the end |
df.apply(lambda ...) on large DataFrames |
Row-by-row Python; GIL contention; slower than equivalent Pandas on single core | Use vectorized Dask operations (map_partitions, assign, arithmetic operators) |
| Default blocksize on large CSV files | 128MB default creates thousands of partitions for 100GB files; scheduler overhead dominates | Set blocksize="256MB" or blocksize="1GB" for large files; profile optimal size |
len(df) without compute() |
Triggers full dataset read and count; defeats lazy evaluation | Use df.shape[0].compute() explicitly; only compute when size is truly needed |
| Threaded scheduler for CPU-bound work | Python GIL serializes CPU computation across threads; no true parallelism | Use dask.distributed.LocalCluster() or process-based scheduler for CPU tasks |
Memory Protocol (MANDATORY)
Before starting:
cat .claude/context/memory/learnings.md
After completing: Record any new patterns or exceptions discovered.
ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.
Weekly Installs
22
Repository
oimiragieo/agent-studioGitHub Stars
16
First Seen
Feb 25, 2026
Security Audits
Installed on
github-copilot22
codex22
kimi-cli22
gemini-cli22
cursor22
opencode22