scanpy
Scanpy - Single-Cell Analysis
Scanpy processes high-dimensional biological data, reducing it via PCA/UMAP to identify rare cell populations in tissues or microbiomes.
When to Use
- Analyzing single-cell RNA sequencing (scRNA-seq) data.
- Identifying cell types and states in heterogeneous tissues.
- Reconstructing developmental trajectories.
- Comparing cell populations between conditions.
- Discovering rare cell types.
Core Principles
AnnData Format
Scanpy uses AnnData objects that store expression matrix, cell metadata, and gene annotations together.
Dimensionality Reduction
High-dimensional gene expression (20,000+ genes) is reduced to 2D/3D for visualization (PCA → UMAP/t-SNE).
Clustering
Cells are grouped by similarity in gene expression space to identify cell types.
Quick Reference
Standard Imports
import scanpy as sc
import pandas as pd
import numpy as np
Basic Patterns
# 1. Load dataset (AnnData object)
adata = sc.read_h5ad("cells.h5ad")
# Or: adata = sc.read_10x_mtx("path/to/mtx")
# 2. QC and Normalization
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# 3. Dimensionality Reduction & Visualization
sc.pp.highly_variable_genes(adata)
sc.tl.pca(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=['cell_type', 'gene_A'])
# 4. Clustering
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color='leiden')
Critical Rules
✅ DO
- Set scanpy settings - Use
sc.settings.verbosity = 3for progress info. - Filter low-quality cells - Remove cells with too few genes or high mitochondrial content.
- Normalize before analysis - Account for sequencing depth differences.
- Use highly variable genes - Focus analysis on informative genes.
❌ DON'T
- Don't skip QC - Low-quality cells can dominate clustering.
- Don't use raw counts for PCA - Always normalize and log-transform first.
- Don't ignore batch effects - Use batch correction (e.g.,
sc.pp.harmony_integrate) when combining datasets.
Advanced Patterns
Trajectory Inference
import cellrank as cr
# Reconstruct developmental trajectories
sc.tl.paga(adata)
sc.pl.paga(adata)
adata.uns['iroot'] = np.flatnonzero(adata.obs['cell_type'] == 'stem')[0]
sc.tl.dpt(adata)
Differential Expression
# Find marker genes for each cluster
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=20)
Scanpy has revolutionized single-cell biology, enabling researchers to map the cellular diversity of tissues and understand how cells differentiate and function.
More from tondevrel/scientific-agent-skills
xgboost-lightgbm
Industry-standard gradient boosting libraries for tabular data and structured datasets. XGBoost and LightGBM excel at classification and regression tasks on tables, CSVs, and databases. Use when working with tabular machine learning, gradient boosting trees, Kaggle competitions, feature importance analysis, hyperparameter tuning, or when you need state-of-the-art performance on structured data.
194opencv
Open Source Computer Vision Library (OpenCV) for real-time image processing, video analysis, object detection, face recognition, and camera calibration. Use when working with images, videos, cameras, edge detection, contours, feature detection, image transformations, object tracking, optical flow, or any computer vision task.
143ortools
Google Optimization Tools. An open-source software suite for optimization, specialized in vehicle routing, flows, integer and linear programming, and constraint programming. Features the world-class CP-SAT solver. Use for vehicle routing problems (VRP), scheduling, bin packing, knapsack problems, linear programming (LP), integer programming (MIP), network flows, constraint programming, combinatorial optimization, resource allocation, shift scheduling, job-shop scheduling, and discrete optimization problems.
75matplotlib
The foundational library for creating static, animated, and interactive visualizations in Python. Highly customizable and the industry standard for publication-quality figures. Use for 2D plotting, scientific data visualization, heatmaps, contours, vector fields, multi-panel figures, LaTeX-formatted plots, custom visualization tools, and plotting from NumPy arrays or Pandas DataFrames.
73plotly
A high-level interactive graphing library for Python. Ideal for web-based visualizations, 3D plots, and complex interactive dashboards. Built on plotly.js, it allows users to zoom, pan, and hover over data points in a browser-based environment. Use for interactive charts, web applications, Jupyter notebooks, 3D data visualization, geographic maps, financial charts, animations, time-series analysis, and building production-ready dashboards with Dash.
51scipy
Comprehensive guide for SciPy - the fundamental library for scientific and technical computing in Python. Use for integration, optimization, interpolation, linear algebra, signal processing, statistics, ODEs, Fourier transforms, and advanced scientific algorithms. Built on NumPy and essential for research and engineering.
51