bulk-rna-seq-batch-correction-with-combat
Bulk RNA-seq batch correction with ComBat
Overview
Apply this skill when a user has multiple bulk expression matrices measured across different batches and needs to harmonise them
before downstream analysis. It follows t_bulk_combat.ipynb, w
hich demonstrates the pyComBat workflow on ovarian cancer microarray cohorts.
Instructions
- Import core libraries
- Load
omicverse as ov,anndata,pandas as pd, andmatplotlib.pyplot as plt. - Call
ov.ov_plot_set()(aliasedov.plot_set()in some releases) to align figures with omicverse styling.
- Load
- Load each batch separately
- Read the prepared pickled matrices (or user-provided expression tables) with
pd.read_pickle(...)/pd.read_csv(...). - Transpose to gene × sample before wrapping them in
anndata.AnnDataobjects soadata.obsstores sample metadata. - Assign a
batchcolumn for every cohort (adata.obs['batch'] = '1','2', ...). Encourage descriptive labels when availa ble.
- Read the prepared pickled matrices (or user-provided expression tables) with
- Concatenate on shared genes
- Use
anndata.concat([adata1, adata2, adata3], merge='same')to retain the intersection of genes across batches. - Confirm the combined
adatareports balanced sample counts per batch; if not, prompt users to re-check inputs.
- Use
- Run ComBat batch correction
- Execute
ov.bulk.batch_correction(adata, batch_key='batch'). - Explain that corrected values are stored in
adata.layers['batch_correction']while the original counts remain inadata.X.
- Execute
- Export corrected and raw matrices
- Obtain DataFrames via
adata.to_df().T(raw) andadata.to_df(layer='batch_correction').T(corrected). - Encourage saving both tables (
.to_csv(...)) plus the harmonised AnnData (adata.write_h5ad('adata_batch.h5ad', compressio n='gzip')).
- Obtain DataFrames via
- Benchmark the correction
- For per-sample variance checks, draw before/after boxplots and recolour boxes using
ov.pl.red_color,blue_color,gree n_colorpalettes to match batches. - Copy raw counts to a named layer with
adata.layers['raw'] = adata.X.copy()before PCA. - Run
ov.pp.pca(adata, layer='raw', n_pcs=50)andov.pp.pca(adata, layer='batch_correction', n_pcs=50). - Visualise embeddings with
ov.pl.embedding(..., basis='raw|original|X_pca', color='batch', frameon='small')and repeat fo r the corrected layer to verify mixing.
- For per-sample variance checks, draw before/after boxplots and recolour boxes using
- Defensive validation
# Before ComBat: verify batch column exists and has >1 batch assert 'batch' in adata.obs.columns, "adata.obs must contain a 'batch' column" n_batches = adata.obs['batch'].nunique() assert n_batches > 1, f"Only {n_batches} batch — need >1 for batch correction" # Verify gene overlap after concatenation if adata.n_vars < 100: print(f"WARNING: Only {adata.n_vars} shared genes after concat — check gene ID harmonization") - Troubleshooting tips
- Mismatched gene identifiers cause dropped features—remind users to harmonise feature names (e.g., gene symbols) before conca tenation.
- pyComBat expects log-scale intensities or similarly distributed counts; recommend log-transforming strongly skewed matrices.
- If
batch_correctionlayer is missing, ensure thebatch_keymatches the column name inadata.obs.
Examples
- "Combine three GEO ovarian cohorts, run ComBat, and export both the raw and corrected CSV matrices."
- "Plot PCA embeddings before and after batch correction to confirm that batches 1–3 overlap."
- "Save the harmonised AnnData file so I can reload it later for downstream DEG analysis."
References
- Tutorial notebook:
t_bulk_combat.ipynb - Example inputs:
omicverse_guide/docs/Tutorials-bulk/data/combat/ - Quick copy/paste commands:
reference.md
More from starlitnightly/omicverse
single-cell-downstream-analysis
AUCell pathway scoring, metacell DEG, scDrug response, SCENIC regulons, cNMF programs, and NOCD community detection in OmicVerse.
46single-cell-annotation-skills-with-omicverse
Cell type annotation: SCSA, MetaTiME, CellVote consensus, CellMatch, GPTAnno, weighted KNN label transfer in OmicVerse.
45bulk-rna-seq-deseq2-analysis-with-omicverse
PyDESeq2 differential expression: ID mapping, DE testing, fold-change thresholding, and GSEA enrichment visualization in OmicVerse.
44single-cell-preprocessing-with-omicverse
Single-cell QC, normalization, HVG detection, PCA, neighbor graph, UMAP/tSNE embedding pipelines in OmicVerse (CPU/GPU).
40data-export-pdf
Create professional PDF reports with text, tables, and embedded images using reportlab. Works with ANY LLM provider (GPT, Gemini, Claude, etc.).
38single-cell-multi-omics-integration
Multi-omics integration: MOFA factor analysis, GLUE unpaired alignment, SIMBA batch correction, TOSICA label transfer, StaVIA trajectory. Covers scRNA+scATAC paired/unpaired workflows.
38