jupyter-notebook-analysis
Jupyter Notebook Analysis Patterns
Expert knowledge for creating comprehensive, statistically rigorous Jupyter notebook analyses.
When to Use This Skill
- Creating multi-cell Jupyter notebooks for data analysis
- Adding correlation analyses with statistical testing
- Implementing outlier removal strategies
- Building series of related visualizations (10+ figures)
- Analyzing large datasets with multiple characteristics
Common Pitfalls
Variable Shadowing in Loops
Problem: Using common variable names like data as loop variables overwrites global variables:
# BAD - Shadows global 'data' variable
for i, (sp, data) in enumerate(species_by_gc_content[:10], 1):
val = data['gc_content']
print(f'{sp}: {val}')
After this loop, data is no longer your dataset list - it's the last species dict!
Solution: Use descriptive loop variable names:
# GOOD - Uses specific name
for i, (sp, sp_data) in enumerate(species_by_gc_content[:10], 1):
val = sp_data['gc_content']
print(f'{sp}: {val}')
Detection: If you see errors like "Type: <class 'dict'>" when expecting a list, check for variable shadowing in recent cells.
Prevention:
- Never use generic names (
data,item,value) as loop variables - Use prefixed names (
sp_data,row_data,inv_data) - Add validation cells that check variable types
- Run "Restart & Run All" regularly to catch issues early
Common shadowing patterns to avoid:
for data in dataset: # Shadows 'data'
for i, data in enumerate(): # Shadows 'data'
for key, data in dict.items() # Shadows 'data'
Verify Column Names Before Processing
Problem: Assuming column names without checking actual DataFrame structure leads to immediate failures. Column names may use different capitalization, spacing, or naming conventions than expected.
Example error:
# Assumed column name
df_filtered = df[df['scientific_name'] == target] # KeyError!
# Actual column name was 'Scientific Name' (capitalized with space)
Solution: Always check actual columns first:
import pandas as pd
df = pd.read_csv('data.csv')
# ALWAYS print columns before processing
print("Available columns:")
print(df.columns.tolist())
# Then write filtering code with correct names
df_filtered = df[df['Scientific Name'] == target_species] # Correct
Best practice for data processing scripts:
# At the start of your script
def verify_required_columns(df, required_cols):
"""Verify DataFrame has required columns."""
missing = [col for col in required_cols if col not in df.columns]
if missing:
print(f"ERROR: Missing columns: {missing}")
print(f"Available columns: {df.columns.tolist()}")
sys.exit(1)
# Use it
required = ['Scientific Name', 'tolid', 'accession']
verify_required_columns(df, required)
Common column name variations to watch for:
scientific_namevsScientific NamevsScientificNamespecies_idvsspeciesvsSpecies IDgenome_sizevsGenome sizevsGenomeSize
Debugging tip: Include column listing in all data processing scripts:
# Add at script start for easy debugging
if '--debug' in sys.argv or len(df.columns) < 10:
print(f"Columns ({len(df.columns)}): {df.columns.tolist()}")
Outlier Handling Best Practices
Two-Stage Outlier Removal
For analyses correlating characteristics across aggregated entities (e.g., species-level summaries):
-
Stage 1: Count-based outliers (IQR method)
- Remove entities with abnormally high sample counts
- Prevents over-represented entities from skewing correlations
- Apply BEFORE other analyses
import numpy as np workflow_counts = [entity_data[id]['workflow_count'] for id in entity_data.keys()] q1 = np.percentile(workflow_counts, 25) q3 = np.percentile(workflow_counts, 75) iqr = q3 - q1 upper_bound = q3 + 1.5 * iqr outliers = [id for id in entity_data.keys() if entity_data[id]['workflow_count'] > upper_bound] for id in outliers: del entity_data[id] -
Stage 2: Value-based outliers (percentile)
- Remove extreme values for visualization clarity
- Apply ONLY to visualization data, not statistics
- Typically top 5% for highly skewed distributions
values = [entity_data[id]['metric'] for id in entity_data.keys()] threshold = np.percentile(values, 95) viz_entities = [id for id in entity_data.keys() if entity_data[id]['metric'] <= threshold] # Use viz_entities for plotting # Use full entity_data.keys() for statistics
Characteristic-Specific Outlier Removal
When analyzing genome characteristics vs metrics, remove outliers for the characteristic being analyzed:
# After removing workflow count outliers, also remove heterozygosity outliers
heterozygosity_values = [species_data[sp]['heterozygosity'] for sp in species_data.keys()]
het_q1 = np.percentile(heterozygosity_values, 25)
het_q3 = np.percentile(heterozygosity_values, 75)
het_iqr = het_q3 - het_q1
het_upper_bound = het_q3 + 1.5 * het_iqr
het_outliers = [sp for sp in species_data.keys()
if species_data[sp]['heterozygosity'] > het_upper_bound]
for sp in het_outliers:
del species_data[sp]
print(f'Removed {len(het_outliers)} heterozygosity outliers (>{het_upper_bound:.2f}%)')
print(f'New heterozygosity range: {min(vals):.2f}% - {max(vals):.2f}%')
Apply separately for each characteristic:
- Genome size outliers for genome size analysis
- Heterozygosity outliers for heterozygosity analysis
- Repeat content outliers for repeat content analysis
When to Skip Outlier Removal
- Memory usage plots when investigating over-allocation patterns
- Comparison plots (allocated vs used) where outliers reveal problems
- User explicitly requests to see all data
- Data is already limited (< 10 points)
Document clearly in plot titles and code comments which outlier removal is applied.
###IQR-Based Outlier Removal for Visualization
Standard Method: 1.5×IQR (Interquartile Range)
Implementation:
# Calculate IQR
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
# Define outlier boundaries (standard: 1.5×IQR)
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR
# Filter outliers
outlier_mask = (data >= lower_bound) & (data <= upper_bound)
data_filtered = data[outlier_mask]
n_outliers = (~outlier_mask).sum()
# IMPORTANT: Report outliers removed
print(f"Removed {n_outliers} outliers for visualization")
# Add to figure: f"({n_outliers} outliers removed)"
Multi-dimensional Outlier Removal:
# For scatter plots with two dimensions (e.g., size ratio AND absolute size)
outlier_mask = (
(ratio >= Q1_ratio - 1.5*IQR_ratio) &
(ratio <= Q3_ratio + 1.5*IQR_ratio) &
(size >= Q1_size - 1.5*IQR_size) &
(size <= Q3_size + 1.5*IQR_size)
)
Best Practice: Always report number of outliers removed in figure statistics or caption.
When to Use: For visualization clarity when extreme values compress the main distribution. Not for removing "bad" data - use for display only.
Statistical Rigor
Required for Correlation Analyses
-
Pearson correlation with p-values:
from scipy import stats correlation, p_value = stats.pearsonr(x_values, y_values) sig_text = 'significant' if p_value < 0.05 else 'not significant' -
Report both metrics:
- Correlation coefficient (r) - strength and direction
- P-value - statistical significance (α=0.05)
- Sample size (n)
-
Display on plots:
ax.text(0.98, 0.02, f'r = {correlation:.3f}\np = {p_value:.2e}\n({sig_text})\nn = {len(data)} species', transform=ax.transAxes, ...)
Adding Mann-Whitney U Tests to Figures
When to Use: Comparing continuous metrics between two groups (e.g., Dual vs Pri/alt curation)
Standard Implementation:
from scipy import stats
# Calculate test
data_group1 = df[df['group'] == 'Group1']['metric']
data_group2 = df[df['group'] == 'Group2']['metric']
if len(data_group1) > 0 and len(data_group2) > 0:
stat, pval = stats.mannwhitneyu(data_group1, data_group2, alternative='two-sided')
else:
pval = np.nan
# Add to stats text
if not np.isnan(pval):
stats_text += f"\nMann-Whitney p: {pval:.2e}"
Display in Figures: Include p-value in statistics box with format Mann-Whitney p: 1.23e-04
Consistency: Ensure all quantitative comparison figures include this test for statistical rigor.
Large-Scale Analysis Structure
Control Analyses: Checking for Confounding
When comparing methods (e.g., Method A vs Method B), always check if observed differences could be explained by characteristics of the samples rather than the methods themselves.
Critical control analysis:
import pandas as pd
from scipy import stats
def check_confounding(df, method_col, characteristics):
"""
Compare sample characteristics between methods to check for confounding.
Args:
df: DataFrame with samples
method_col: Column indicating method ('Method_A', 'Method_B')
characteristics: List of column names to compare
Returns:
DataFrame with statistical comparison
"""
results = []
for char in characteristics:
# Get data for each method
method_a = df[df[method_col] == 'Method_A'][char].dropna()
method_b = df[df[method_col] == 'Method_B'][char].dropna()
if len(method_a) < 5 or len(method_b) < 5:
continue
# Statistical test
stat, pval = stats.mannwhitneyu(method_a, method_b, alternative='two-sided')
# Calculate effect size (% difference in medians)
pooled_median = pd.concat([method_a, method_b]).median()
effect_pct = (method_a.median() - method_b.median()) / pooled_median * 100
results.append({
'Characteristic': char,
'Method_A_median': method_a.median(),
'Method_A_n': len(method_a),
'Method_B_median': method_b.median(),
'Method_B_n': len(method_b),
'p_value': pval,
'effect_pct': effect_pct,
'significant': pval < 0.05
})
return pd.DataFrame(results)
# Example usage
characteristics = ['genome_size', 'gc_content', 'heterozygosity',
'repeat_content', 'sequencing_coverage']
confounding_check = check_confounding(df, 'curation_method', characteristics)
print(confounding_check)
Interpretation guide:
- No significant differences: Methods compared equivalent samples → valid comparison
- Method A has "easier" samples (smaller genomes, lower complexity): Quality differences may be due to sample properties, not method
- Method A has "harder" samples (larger genomes, higher complexity): Strengthens conclusion that Method A is better despite challenges
- Limited data (n<10): Cannot rule out confounding, note as limitation
Present in notebook:
## Genome Characteristics Comparison
**Control Analysis**: Are quality differences due to method or sample properties?
[Table comparing characteristics]
**Conclusion**:
- If no differences → Valid method comparison
- If Method A works with harder samples → Strengthens conclusions
- If Method A works with easier samples → Potential confounding
Why critical: Reviewers will ask this question. Preemptive control analysis demonstrates scientific rigor and prevents major revisions.
Organizing 60+ Cell Notebooks
-
Section headers (markdown cells):
- Main sections: "## CPU Runtime Analysis", "## Memory Analysis"
- Subsections: "### Genome Size vs CPU Runtime"
-
Cell pairing pattern:
- Markdown header + code cell for each analysis
- Keeps related content together
- Easier to navigate and debug
-
Consistent naming:
- Figure files:
fig18_genome_size_vs_cpu_hours.png - Variables:
species_data,genome_sizes_full,genome_sizes_viz - Functions:
safe_float_convert()defined consistently
- Figure files:
-
Progressive enhancement:
- Start with basic analyses
- Add enriched data (Cell 7 pattern)
- Build increasingly complex correlations
- End with multivariate analyses (PCA)
Template Generation Pattern
For creating multiple similar analysis cells:
# Create template with placeholder variables
template = '''
if len(data_with_species) > 0:
print('Analyzing {display} vs {metric}...\\n')
# Aggregate data per species
species_data = {{}}
for inv in data_with_species:
{name} = safe_float_convert(inv.get('{name}'))
if {name} is None:
continue
# ... analysis code
'''
# Generate multiple cells from characteristics list
characteristics = [
{'name': 'genome_size', 'display': 'Genome Size', 'unit': 'Gb'},
{'name': 'heterozygosity', 'display': 'Heterozygosity', 'unit': '%'},
# ...
]
for char in characteristics:
code = template.format(**char)
# Write to notebook or temp file
Helper Function Pattern
Define once, reuse throughout:
def safe_float_convert(value):
"""Convert string to float, handling comma separators"""
if not value or not str(value).strip():
return None
try:
return float(str(value).replace(',', ''))
except (ValueError, TypeError):
return None
Include in Cell 7 (enrichment) and reference: "# Helper function (same as Cell 7)"
Publication-Quality Figures
Standard settings:
- DPI: 300
- Figure size: (12, 8) for single plots, (16, 7) for side-by-side
- Grid:
alpha=0.3, linestyle='--' - Point size: Proportional to sample count (
s=[50 + count*20 for count in counts]) - Colormap: 'viridis' for workflow counts
Publication-Ready Font Sizes
Problem: Default matplotlib fonts are designed for screen viewing, not print publication.
Solution: Use larger, bold fonts for print readability.
Recommended sizes (for standard 10-12 cm wide figures):
| Element | Default | Publication | Code |
|---|---|---|---|
| Title | 11-12pt | 18pt (bold) | fontsize=18, fontweight='bold' |
| Axis labels | 10-11pt | 16pt (bold) | fontsize=16, fontweight='bold' |
| Tick labels | 9-10pt | 14pt | tick_params(labelsize=14) |
| Legend | 8-10pt | 12pt | legend(fontsize=12) |
| Annotations | 8-10pt | 11-13pt | fontsize=12 |
| Data points | 20-36 | 60-100 | s=80 (scatter) |
Implementation example:
fig, ax = plt.subplots(figsize=(10, 8))
# Plot data
ax.scatter(x, y, s=80, alpha=0.6) # Larger points
# Titles and labels - BOLD
ax.set_title('Your Title Here', fontsize=18, fontweight='bold')
ax.set_xlabel('X Axis Label', fontsize=16, fontweight='bold')
ax.set_ylabel('Y Axis Label', fontsize=16, fontweight='bold')
# Tick labels
ax.tick_params(axis='both', which='major', labelsize=14)
# Legend
ax.legend(fontsize=12, loc='best')
# Stats box
stats_text = "Statistics:\nMean: 42.5"
ax.text(0.02, 0.98, stats_text, transform=ax.transAxes,
fontsize=13, family='monospace',
bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.3))
# Reference lines - thicker
ax.axhline(y=1.0, linewidth=2.5, linestyle='--', alpha=0.6)
Quick check: If you have to squint to read the figure on screen at 100% zoom, fonts are too small for print.
Special cases:
- Multi-panel figures: Increase 10-15% more
- Posters: Increase 50-100% more
- Presentations: Increase 30-50% more
Accessibility: Colorblind-Safe Palettes
Problem: Standard color schemes (green vs blue, red vs green) are difficult or impossible to distinguish for people with color vision deficiencies, affecting ~8% of males and ~0.5% of females.
Solution: Use colorblind-safe palettes from validated sources.
IBM Color Blind Safe Palette (Recommended):
# For comparing two groups/conditions
colors = {
'Group_A': '#0173B2', # Blue
'Group_B': '#DE8F05' # Orange
}
Why this works:
- ✅ Maximum contrast for all color vision types (deuteranopia, protanopia, tritanopia, achromatopsia)
- ✅ Professional appearance for scientific publications
- ✅ Clear distinction even in grayscale printing
- ✅ Cultural neutrality (no red/green traffic light associations)
Other colorblind-safe combinations:
- Blue + Orange (best overall)
- Blue + Red (good for most types)
- Blue + Yellow (good but lower contrast)
Avoid:
- ❌ Green + Red (most common color blindness)
- ❌ Green + Blue (confusing for many)
- ❌ Blue + Purple (too similar)
Implementation in matplotlib:
import matplotlib.pyplot as plt
# Define colorblind-safe palette
CB_COLORS = {
'blue': '#0173B2',
'orange': '#DE8F05',
'green': '#029E73',
'red': '#D55E00',
'purple': '#CC78BC',
'brown': '#CA9161'
}
# Use in plots
plt.scatter(x, y, color=CB_COLORS['blue'], label='Treatment')
plt.scatter(x2, y2, color=CB_COLORS['orange'], label='Control')
Testing your colors:
- Use online simulators: https://www.color-blindness.com/coblis-color-blindness-simulator/
- Check in grayscale: Convert figure to grayscale to ensure distinguishability
Handling Severe Data Imbalance in Comparisons
Problem: Comparing groups with very different sample sizes (e.g., 84 vs 10) can lead to misleading conclusions.
Solution: Add prominent warnings both visually and in documentation.
Visual warning on figure:
import matplotlib.pyplot as plt
# After creating your plot
n_group_a = len(df[df['group'] == 'A'])
n_group_b = len(df[df['group'] == 'B'])
total_a = 200
total_b = 350
warning_text = f"⚠️ DATA LIMITATION\n"
warning_text += f"Data availability:\n"
warning_text += f" Group A: {n_group_a}/{total_a} ({n_group_a/total_a*100:.1f}%)\n"
warning_text += f" Group B: {n_group_b}/{total_b} ({n_group_b/total_b*100:.1f}%)\n"
warning_text += f"Severe imbalance limits\nstatistical comparability"
ax.text(0.98, 0.02, warning_text, transform=ax.transAxes,
fontsize=11, verticalalignment='bottom', horizontalalignment='right',
bbox=dict(boxstyle='round', facecolor='red', alpha=0.2,
edgecolor='red', linewidth=2),
family='monospace', color='darkred', fontweight='bold')
# Update title to indicate limitation
ax.set_title('Your Title\n(SUPPLEMENTARY - Limited Data Availability)',
fontsize=14, fontweight='bold')
Text warning in notebook/paper:
**⚠️ CRITICAL DATA LIMITATION**: This figure suffers from severe data availability bias:
- Group A: 84/200 (42%)
- Group B: 10/350 (3%)
This **8-fold imbalance** severely limits statistical comparability. The 10 Group B
samples are unlikely to be representative of all 350.
**Interpretation**: Comparisons should be interpreted with extreme caution. This
figure is provided for completeness but should be considered **supplementary**.
Guidelines for sample size imbalance:
- < 2× imbalance: Generally acceptable, note in caption
- 2-5× imbalance: Add note about limitations
- > 5× imbalance: Add prominent warnings (visual + text)
- > 10× imbalance: Consider excluding figure or supplementary-only
Alternative: If possible, subset the larger group to match sample size:
# Random subset to balance groups
if n_group_a > n_group_b * 2:
group_a_subset = df[df['group'] == 'A'].sample(n=n_group_b * 2, random_state=42)
# Use subset for balanced comparison
Creating Analysis Notebooks for Scientific Publications
When creating Jupyter notebooks to accompany manuscript figures:
Structure Pattern
- Title and metadata - Date, dataset info, sample sizes
- Overview - Context from paper abstract/intro
- Figure-by-figure analysis:
- Code cell to display image
- Detailed figure legend (publication-ready)
- Comprehensive analysis paragraph explaining:
- What the metric measures
- Statistical results
- Mechanistic explanation
- Biological/technical implications
- Methods section - Complete reproducibility information
- Conclusions - Summary of findings
Table of Contents
For analysis notebooks >10 cells, add a navigable table of contents at the top:
Benefits:
- Quick navigation to specific analyses
- Clear overview of notebook structure
- Professional presentation
- Easier for collaborators
Implementation (Markdown cell):
# Analysis Name
## Table of Contents
1. [Data Loading](#data-loading)
2. [Data Quality Metrics](#data-quality-metrics)
3. [Figure 1: Completeness](#figure-1-completeness)
4. [Figure 2: Contiguity](#figure-2-contiguity)
5. [Figure 3: Scaffold Validation](#figure-3-scaffold-validation)
...
10. [Methods](#methods)
11. [References](#references)
---
Section Headers (Markdown cells):
## Data Loading
[Your code/analysis]
---
## Data Quality Metrics
[Your code/analysis]
Auto-generation: For large notebooks, consider generating TOC programmatically:
from IPython.display import Markdown
sections = ['Introduction', 'Data Loading', 'Analysis', ...]
toc = "## Table of Contents\n\n"
for i, section in enumerate(sections, 1):
anchor = section.lower().replace(' ', '-')
toc += f"{i}. [{section}](#{anchor})\n"
display(Markdown(toc))
Methods Documentation
Always include a Methods section documenting:
- Data sources with accession numbers
- Key algorithms and formulas
- Statistical approaches
- Software versions
- Special adjustments (e.g., sex chromosome correction)
- Literature citations
Example:
## Methods
### Karyotype Data
Karyotype data (diploid 2n and haploid n chromosome numbers) was manually curated from peer-reviewed literature for 97 species representing 17.8% of the VGP Phase 1 dataset (n = 545 assemblies).
#### Sex Chromosome Adjustment
When both sex chromosomes are present in the main haplotype, the expected number of chromosome-level scaffolds is:
**expected_scaffolds = n + 1**
For example:
- Asian elephant: 2n=56, n=28, has X+Y → expected 29 scaffolds
- White-throated sparrow: 2n=82, n=41, has Z+W → expected 42 scaffolds
This adjustment accounts for the biological reality that X and Y (or Z and W) are distinct chromosomes.
Writing Style Matching
To match manuscript style:
- Read draft paper PDF to extract tone and terminology
- Use same technical vocabulary
- Match paragraph structure (observation → mechanism → implication)
- Include specific details (tool names, file formats, software versions)
- Use first-person plural ("we") if paper does
- Maintain consistent bullet point/list formatting
Example Code Pattern
# Display figure
from IPython.display import Image, display
from pathlib import Path
FIG_DIR = Path('figures/analysis_name')
display(Image(filename=str(FIG_DIR / 'figure_01.png')))
Figure Legend Format
Figure N. [Short title]. [Complete description of panels and what's shown]. [Statistical tests used]. [Sample sizes]. [Scale information]. [Color coding].
Analysis Paragraph Structure
- What it measures - Define the metric/comparison
- Statistical result - Quantitative findings with p-values
- Mechanistic explanation - Why this result occurs
- Implications - What this means for conclusions
Methods Section Must Include
- Dataset source and filtering criteria
- Metric definitions
- Outlier handling approach
- Statistical methods with justification
- Software versions and tools
- Reproducibility information
- Known limitations
This approach creates notebooks that serve both as analysis documentation and as supplementary material for publications.
Environment Setup
For CLI-based workflows (Claude Code, SSH sessions):
# Run in background with token authentication
/path/to/conda/envs/ENV_NAME/bin/jupyter lab --no-browser --port=8888
Parameters:
--no-browser: Don't auto-open browser (for remote sessions)--port=8888: Specify port (default, can change if occupied)- Run in background: Use
run_in_background=truein Bash tool
Access URL format:
http://localhost:8888/lab?token=TOKEN_STRING
To stop later:
- Find shell ID from BashOutput tool
- Use KillShell with that ID
Installation if missing:
/path/to/conda/envs/ENV_NAME/bin/pip install jupyterlab
Notebook Size Management
For notebooks > 256 KB:
- Use
jqto read specific cells:cat notebook.ipynb | jq '.cells[10:20]' - Count cells:
cat notebook.ipynb | jq '.cells | length' - Check sections:
cat notebook.ipynb | jq '.cells[75:81] | .[].source[:2]'
Data Enrichment Pattern
When linking external metadata with analysis data:
# Cell 6: Load genome metadata
import csv
genome_data = []
with open('genome_metadata.tsv') as f:
reader = csv.DictReader(f, delimiter='\t')
genome_data = list(reader)
genome_lookup = {}
for row in genome_data:
species_id = row['species_id']
if species_id not in genome_lookup:
genome_lookup[species_id] = []
genome_lookup[species_id].append(row)
# Cell 7: Enrich workflow data with genome characteristics
for inv in data:
species_id = inv.get('species_id')
if species_id and species_id in genome_lookup:
genome_info = genome_lookup[species_id][0]
# Add genome characteristics
inv['genome_size'] = genome_info.get('Genome size', '')
inv['heterozygosity'] = genome_info.get('Heterozygosity', '')
# ... other characteristics
else:
# Set to None for missing data
inv['genome_size'] = None
inv['heterozygosity'] = None
# Create filtered dataset
data_with_species = [inv for inv in data if inv.get('species_id') and inv.get('genome_size')]
Data Backup Strategy
The Problem
Long-running data enrichment projects risk:
- Losing days of work from accidental overwrites
- Unable to revert to previous data states
- No documentation of what changed when
- Running out of disk space from manual backups
Solution: Automated Two-Tier Backup System
Architecture:
- Daily backups - Rolling 7-day window (auto-cleanup)
- Milestone backups - Permanent, compressed (gzip ~80% reduction)
- CHANGELOG - Automatic documentation of all changes
Implementation:
# Daily backup (start of each work session)
./backup_table.sh
# Milestone backup (after major changes)
./backup_table.sh milestone "added genomescope data for 21 species"
# List all backups
./backup_table.sh list
# Restore from backup (with safety backup)
./backup_table.sh restore 2026-01-23
Directory structure:
backups/
├── daily/ # Rolling 7-day backups (~770KB each)
│ ├── backup_2026-01-17.csv
│ └── backup_2026-01-23.csv
├── milestones/ # Permanent compressed backups (~200KB each)
│ ├── milestone_2026-01-20_initial_enrichment.csv.gz
│ └── milestone_2026-01-23_recovered_accessions.csv.gz
├── CHANGELOG.md # Auto-generated change log
└── README.md # User documentation
Storage efficiency:
- Daily backups: ~5.4 MB (7 days × 770KB)
- Milestone backups: ~200KB each compressed (80% size reduction)
- Total: <10 MB for complete project history
- Old daily backups auto-delete after 7 days
When to create milestones:
- After adding new data sources (GenomeScope, karyotypes, etc.)
- Before major data transformations
- When completing analysis sections
- Before submitting/publishing
Global installer available:
# Install backup system in any repository
install-backup-system -f your_data_file.csv
Key features:
- Never overwrites without confirmation
- Creates safety backup before restore
- Complete audit trail in CHANGELOG
- Color-coded terminal output
- Handles both CSV and TSV files
Benefits for data analysis:
- Data provenance - CHANGELOG documents every modification
- Confidence to experiment - Easy rollback encourages trying approaches
- Professional workflow - Matches publication standards
- Collaboration-ready - Team members can understand data history
Debugging Data Availability
Before creating correlation plots, verify data overlap:
# Check how many entities have both metrics
species_with_metric_a = set(inv.get('species_id') for inv in data
if inv.get('metric_a'))
species_with_metric_b = set(inv.get('species_id') for inv in data
if inv.get('metric_b'))
overlap = species_with_metric_a.intersection(species_with_metric_b)
print(f"Species with both metrics: {len(overlap)}")
if len(overlap) < 10:
print("⚠️ Warning: Limited data for correlation analysis")
print(f" Metric A: {len(species_with_metric_a)} species")
print(f" Metric B: {len(species_with_metric_b)} species")
print(f" Overlap: {len(overlap)} species")
Variable State Validation
When debugging notebook errors, add validation cells to check variable integrity:
# Validation cell - place before error-prone sections
print('=== VARIABLE VALIDATION ===')
print(f'Type of data: {type(data)}')
print(f'Is data a list? {isinstance(data, list)}')
if isinstance(data, list):
print(f'Length: {len(data)}')
if len(data) > 0:
print(f'First item type: {type(data[0])}')
print(f'First item keys: {list(data[0].keys())[:10]}')
elif isinstance(data, dict):
print(f'⚠️ WARNING: data is a dict, not a list!')
print(f'Dict keys: {list(data.keys())[:10]}')
print(f'This suggests variable shadowing occurred.')
When to use:
- After "Restart & Run All" produces errors
- When error messages suggest wrong variable type
- Before cells that fail intermittently
- In notebooks with 50+ cells
Best practice: Include automatic validation in cells that depend on critical global variables.
Programmatic Notebook Manipulation
When inserting cells into large notebooks:
import json
# Read notebook
with open('notebook.ipynb', 'r') as f:
notebook = json.load(f)
# Create new cell
new_cell = {
"cell_type": "code",
"execution_count": None,
"metadata": {},
"outputs": [],
"source": [line + '\n' for line in code.split('\n')]
}
# Insert at position
insert_position = 50
notebook['cells'] = (notebook['cells'][:insert_position] +
[new_cell] +
notebook['cells'][insert_position:])
# Write back
with open('notebook.ipynb', 'w') as f:
json.dump(notebook, f, indent=1)
Synchronizing Figure Code and Notebook Documentation
Pattern: Code changes to figure generation → Must update notebook text
Common Scenario: Updated figure filtering/outlier removal/statistical tests
Workflow:
- Update figure generation Python script
- Regenerate figures
- CRITICAL: Update Jupyter notebook markdown cells documenting the figure
- Use
NotebookEdittool (NOTEdittool) for.ipynbfiles
Example:
# After adding Mann-Whitney test to figure generation:
NotebookEdit(
notebook_path="/path/to/notebook.ipynb",
cell_id="cell-14", # Found via grep or Read
cell_type="markdown",
new_source="Updated description mentioning Mann-Whitney test..."
)
Finding Figure Cells:
# Locate figure references
grep -n "figure_name.png" notebook.ipynb
# Or use Glob + Grep
grep -n "Figure 4" notebook.ipynb
Why Critical: Outdated documentation causes confusion. Notebook text saying "Limited data" when data is now complete, or not mentioning new statistical tests, misleads readers.
Best Practices Summary
- Always check data availability before creating analyses
- Document outlier removal clearly in titles and comments
- Use consistent naming for variables and figures
- Include statistical testing for all correlations
- Separate visualization from statistics when filtering outliers
- Create templates for repetitive analyses
- Use helper functions consistently across cells
- Organize with markdown headers for navigation
- Test with small datasets before running full analyses
- Save intermediate results for expensive computations
Common Tasks
Removing Panels from Multi-Panel Figures
Scenario: Convert 2-panel figure to 1-panel after removing unavailable data.
Steps:
-
Update subplot layout:
# Before: 2 panels fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) # After: 1 panel fig, ax = plt.subplots(1, 1, figsize=(10, 6)) -
Remove panel code: Delete all code for removed panel (ax2)
-
Update figure filename:
# Before plt.savefig('06_scaffold_l50_l90_comparison.png') # After plt.savefig('06_scaffold_l50_comparison.png') -
Update notebook references:
- Image display:
display(Image(...'06_scaffold_l50_comparison.png')) - Title: Remove references to removed data
- Description: Add note about why panel is excluded
- Image display:
-
Clean up old files:
rm figures/*_l50_l90_*.png