jupyter-notebook-analysis
Jupyter Notebook Analysis Patterns
Expert knowledge for creating comprehensive, statistically rigorous Jupyter notebook analyses.
When to Use This Skill
- Creating multi-cell Jupyter notebooks for data analysis
- Adding correlation analyses with statistical testing
- Implementing outlier removal strategies
- Building series of related visualizations (10+ figures)
- Analyzing large datasets with multiple characteristics
- Building data update/enrichment notebooks with multi-source merging
- Generating figures for sharing with Claude or other AI tools
Important: Image Size Constraints
When generating images to share with Claude, images must not exceed 8000 pixels in either dimension. Add this helper to your notebook imports:
# Standard imports with Claude size checking
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
MAX_CLAUDE_DIM = 7999 # Claude API limit with safety margin
def save_figure(filename, dpi=300, **kwargs):
"""Save figure with automatic Claude size constraint check."""
plt.savefig(filename, dpi=dpi, bbox_inches='tight', **kwargs)
# Verify and auto-resize if needed
img = Image.open(filename)
if img.width > MAX_CLAUDE_DIM or img.height > MAX_CLAUDE_DIM:
print(f"Auto-resizing {filename} for Claude compatibility")
print(f" Original: {img.width}x{img.height}")
img.thumbnail((MAX_CLAUDE_DIM, MAX_CLAUDE_DIM), Image.Resampling.LANCZOS)
img.save(filename)
print(f" Resized: {img.width}x{img.height}")
else:
print(f"OK {filename}: {img.width}x{img.height}")
# Safe figure sizes for Claude (300 DPI)
FIG_SIZES = {
'small': (7, 5), # 2100x1500 px
'medium': (12, 9), # 3600x2700 px
'large': (20, 15), # 6000x4500 px
'max': (26, 26), # 7800x7800 px - maximum safe
}
# Use in notebook
fig, ax = plt.subplots(figsize=FIG_SIZES['medium'])
# ... plotting code ...
save_figure('figure.png')
For complete image size guidance, see the data-visualization skill.
Core Notebook Patterns
Data Update/Enrichment Notebooks
Use structured notebook patterns for multi-source data merging and enrichment. Key principles:
- Configuration section at top with safety defaults (
ENABLE_AWS_FETCH = False,TEST_MODE = True) - Composite keys for complex merge uniqueness requirements
- Conflict resolution with configurable strategy (NEW vs OLD priority)
- Idempotent column addition -- check if columns exist before adding
- Enrichment tracking -- count what was actually saved, not just fetched
- Two-stage file workflow -- input file -> distinct output file (never in-place)
For detailed patterns including data update, enrichment, and AWS GenomeArk workflows, see notebook-patterns.md.
Notebook Editing
Always use NotebookEdit tool for .ipynb file modifications -- never the Edit tool (corrupts JSON structure).
Three modes: replace (update cell content), insert (add new cell after target), delete (remove cell).
Key rules:
- Always specify
cell_typewhen inserting - Find cell IDs with
jqor Python JSON parsing - After programmatic edits, instruct user to "Restart & Run All"
- Update in dependency order when changing metrics across cells
For NotebookEdit usage, programmatic JSON manipulation, bulk operations, and cell newline handling, see notebook-editing.md.
Statistical Methods
Required for All Correlation Analyses
- Pearson correlation with p-values using
scipy.stats.pearsonr - Report r, p-value, and n on every correlation plot
- Mann-Whitney U test for group comparisons
Outlier Handling
- Stage 1: Count-based outliers (IQR method) -- remove before analysis
- Stage 2: Value-based outliers (percentile) -- apply only to visualization, not statistics
- Apply characteristic-specific outlier removal separately per analysis
- Always report number of outliers removed
Statistical Claim Verification (CRITICAL)
BEFORE finalizing any analysis notebook, verify ALL statistical claims against actual computed values. Text claims can become stale after data/code updates. Extract claims, rerun tests, create verification table.
For detailed statistical methods, outlier removal code, claim verification workflow, and confounding analysis, see statistical-methods.md.
Publication-Quality Figures
Key Standards
- DPI: 300 for publication, 150 for digital viewing
- Font sizes: Title 18pt bold, axis labels 16pt bold, ticks 14pt, legend 12pt
- Colors: Use colorblind-safe palettes (IBM/Okabe-Ito). Blue
#0173B2+ Orange#DE8F05for two-group comparisons - Data imbalance: Add prominent warnings when sample size ratio > 5x
Image Display
- Use HTML
<img>tags in markdown cells for responsive SVG/PNG scaling - Crop SVGs by modifying
viewBoxattributes directly (no ImageMagick needed) - Manage DPI to prevent "Output too large" errors (use 150 DPI default)
For detailed font size tables, color palette code, imbalance handling, SVG manipulation, and DPI management, see visualization-guide.md.
Notebook Organization
Large Notebooks (60+ cells)
- Use markdown section headers with cell pairing pattern
- Consistent naming for figures, variables, and functions
- Progressive enhancement from basic to complex analyses
Dual-Notebook System
For analyses with 5+ figures preparing for publication:
- Code notebook: Executable analysis, figure generation, statistical tests
- Presentation notebook: Figure displays, captions, interpretations, methods
Splitting and Deprecation
When splitting notebooks, recreate all calculated columns and variable definitions in each split. When deprecating, create dated directories with documentation.
For figure usage analysis, splitting strategies, dual-notebook workflow, publication notebook structure, TOC generation, deprecation workflow, and migration guides, see notebook-organization.md.
Sharing and Export
Key Rules
- Preserve outputs when preparing sharing packages (outputs ARE the documentation)
- Use relative paths (never absolute) for portability
- HTML export is best for sharing (self-contained, no software needed)
- Update paths programmatically when moving notebooks to subdirectories
For path management, HTML/PDF/LaTeX export, sharing package structure, and output preservation guidelines, see sharing-and-export.md.
Template and Helper Patterns
Template Generation
For creating multiple similar analysis cells:
template = '''
if len(data_with_species) > 0:
print('Analyzing {display} vs {metric}...\\n')
species_data = {{}}
for inv in data_with_species:
{name} = safe_float_convert(inv.get('{name}'))
if {name} is None:
continue
# ... analysis code
'''
characteristics = [
{'name': 'genome_size', 'display': 'Genome Size', 'unit': 'Gb'},
{'name': 'heterozygosity', 'display': 'Heterozygosity', 'unit': '%'},
]
for char in characteristics:
code = template.format(**char)
Helper Function Pattern
Define once, reuse throughout:
def safe_float_convert(value):
"""Convert string to float, handling comma separators"""
if not value or not str(value).strip():
return None
try:
return float(str(value).replace(',', ''))
except (ValueError, TypeError):
return None
Troubleshooting
Key pitfalls to watch for:
- Variable shadowing: Never use
dataas a loop variable (shadows global) - Column name mismatches: Always print
df.columns.tolist()before processing - Cell execution order: After NotebookEdit inserts, "Restart & Run All"
- Notebook size: Use
jqfor notebooks > 256 KB
For detailed troubleshooting, variable validation, debugging techniques, and environment setup, see troubleshooting.md.
Best Practices Summary
- Always check data availability before creating analyses
- Document outlier removal clearly in titles and comments
- Use consistent naming for variables and figures
- Include statistical testing for all correlations
- Separate visualization from statistics when filtering outliers
- Create templates for repetitive analyses
- Use helper functions consistently across cells
- Organize with markdown headers for navigation
- Test with small datasets before running full analyses
- Save intermediate results for expensive computations
- Use NotebookEdit tool for all
.ipynbfile modifications
Supporting Files Reference
| File | Contents |
|---|---|
| notebook-patterns.md | Data update, enrichment, AWS GenomeArk patterns |
| notebook-editing.md | NotebookEdit tool, programmatic manipulation, metrics updates |
| visualization-guide.md | Publication figures, colors, image display, SVG, DPI |
| statistical-methods.md | Outlier handling, statistical rigor, claim verification |
| notebook-organization.md | Splitting, dual-notebook, deprecation, figure analysis |
| sharing-and-export.md | Paths, HTML/PDF export, sharing packages |
| troubleshooting.md | Common pitfalls, debugging, validation, environment |