data-science-notebooks
Interactive Notebooks
Use this skill for creating reproducible, well-structured notebooks for data exploration, analysis, and communication.
When to use this skill
- Exploratory analysis — interactively investigate data
- Reproducible research — document methodology with code and results
- Teaching/demos — explain concepts with executable examples
- Stakeholder communication — share insights with narrative + visuals
- Prototyping — quickly iterate on data transformations or models
Tool selection
| Tool | Best For | Key Feature |
|---|---|---|
| JupyterLab | Traditional data science, extensions ecosystem | Full IDE experience |
| marimo | Reproducible notebooks, reactive execution | Python-native, version-control friendly |
| VS Code + Jupyter | IDE-native notebook experience | Intellisense, debugging, git integration |
| Google Colab | Cloud GPUs, sharing, collaboration | Free TPU/GPU, easy sharing |
Core principles
1) Structure for readability
# Title: Clear project/question description
## Setup
Imports and configuration
## Data Loading
Load and validate data
## Analysis
- Subsection per question/hypothesis
- Clear markdown explanations
- Visualizations with interpretations
## Conclusions
Key findings and next steps
2) Ensure reproducibility
# Set random seeds
import numpy as np
import random
np.random.seed(42)
random.seed(42)
# Pin versions in requirements.txt or environment.yml
# requirements.txt example:
# pandas==2.1.0
# scikit-learn==1.3.0
3) Keep cells focused
- One concept per cell
- Avoid cells with >50 lines
- Refactor helper functions to
.pyfiles
4) Never hardcode secrets
# ✅ Use environment variables
import os
api_key = os.environ.get("OPENAI_API_KEY")
# ❌ Never do this
api_key = "sk-abc123..."
Jupyter best practices
Magic commands (Jupyter/IPython)
# In a Jupyter cell (these are IPython magics, not standard Python)
# Auto-reload modules during development
# %load_ext autoreload
# %autoreload 2
# Timing
# %timeit function_call()
# Debugging
# %debug
# Environment info (requires watermark package)
# %watermark -v -m -p numpy,pandas,sklearn
Clean outputs before git
# Using nbstripout
pip install nbstripout
nbstripout --install
# Or pre-commit hook
pip install pre-commit
pre-commit install
marimo advantages
Reactive execution
# marimo notebook - cells auto-recompute when dependencies change
import marimo as mo
slider = mo.ui.slider(1, 100, value=50)
slider # Display the slider
# This cell re-runs automatically when slider changes
df_filtered = df[df['value'] > slider.value]
Version control friendly
- Pure Python (
.pyfiles) - No output blobs in git
- Readable diffs
Convert Jupyter to marimo
marimo convert notebook.ipynb -o notebook.py
Common anti-patterns
- ❌ Running cells out of order (Jupyter)
- ❌ Giant cells with mixed concerns
- ❌ Hardcoded file paths
- ❌ No markdown explanations
- ❌ Committing large output files
- ❌ Inline data (use data/ folder)
Progressive disclosure
../references/jupyter-advanced.md— Widgets, extensions, debugging../references/marimo-guide.md— Reactive patterns, UI components../references/notebook-testing.md— Unit tests for notebook code../references/sharing-publishing.md— nbconvert, Quarto, Voilà
Related skills
@data-science-eda— Exploration patterns for notebooks@data-science-interactive-apps— Convert notebooks to apps@data-engineering-core— Production-ready code patterns
References
- Jupyter Documentation
- marimo Documentation
- nbstripout
- Quarto (publishing)
More from legout/data-agent-skills
data-engineering-storage-remote-access-integrations-iceberg
Apache Iceberg catalog configuration for cloud storage (S3, GCS, Azure). Covers AWS Glue and REST catalogs, table scanning, and append/overwrite operations.
4data-engineering-storage-remote-access-integrations-duckdb
Using DuckDB with remote cloud storage via HTTPFS extension, fsspec, and Delta Lake integration. Covers S3, GCS, Azure, and S3-compatible endpoints.
4data-engineering-storage-remote-access-libraries-pyarrow-fs
Native Arrow filesystem integration with PyArrow. Optimized for Parquet workflows, zero-copy data transfer, predicate pushdown, and column pruning. Covers S3, GCS, HDFS with PyArrow datasets.
4flowerpower
Create and manage data pipelines using the FlowerPower framework with Hamilton DAGs and uv. Lightweight orchestration for batch ETL, data transformation, and ML pipelines. Integrates with Delta Lake, DuckDB, Polars, and cloud storage.
4data-engineering-core
Core Python data engineering: Polars, DuckDB, PyArrow, PostgreSQL, ETL patterns, performance tuning, and resilient pipeline construction. Use when building or reviewing batch ETL/dataframe/SQL pipelines in Python.
4data-engineering-storage-lakehouse
Lakehouse table formats: Delta Lake, Apache Iceberg, and Apache Hudi for ACID transactions, schema evolution, and time travel on data lakes.
4