skills/mims-harvard/tooluniverse/tooluniverse-rnaseq-deseq2

tooluniverse-rnaseq-deseq2

SKILL.md

RNA-seq Differential Expression Analysis (DESeq2)

Differential expression analysis of RNA-seq count data using PyDESeq2, with enrichment analysis (gseapy) and gene annotation via ToolUniverse.

BixBench Coverage: Validated on 53 BixBench questions across 15 computational biology projects.

Core Principles

  1. Data-first - Load and validate count data and metadata BEFORE any analysis
  2. Statistical rigor - Proper normalization, dispersion estimation, multiple testing correction
  3. Flexible design - Single-factor, multi-factor, and interaction designs
  4. Threshold awareness - Apply user-specified thresholds exactly (padj, log2FC, baseMean)
  5. Reproducible - Set random seeds, document all parameters
  6. Question-driven - Parse what the user is actually asking; extract the specific answer
  7. Enrichment integration - Chain DESeq2 results into pathway/GO enrichment when requested

When to Use

  • RNA-seq count matrices needing differential expression analysis
  • DESeq2, DEGs, padj, log2FC questions
  • Dispersion estimates or diagnostics
  • GO, KEGG, Reactome enrichment on DEGs
  • Specific gene expression changes between conditions
  • Batch effect correction in RNA-seq

Required Packages

import pandas as pd, numpy as np
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
import gseapy as gp          # enrichment (optional)
from tooluniverse import ToolUniverse  # annotation (optional)

Analysis Workflow

Step 1: Parse the Question

Extract: data files, thresholds (padj/log2FC/baseMean), design factors, contrast, direction, enrichment type, specific genes. See question_parsing.md.

Step 2: Load & Validate Data

Load counts + metadata, ensure samples-as-rows/genes-as-columns, verify integer counts, align sample names, remove zero-count genes. See data_loading.md.

Step 2.5: Inspect Metadata (REQUIRED)

List ALL metadata columns and levels. Categorize as biological interest vs batch/block. Build design formula with covariates first, factor of interest last. See design_formula_guide.md.

Step 3: Run PyDESeq2

Set reference level via pd.Categorical, create DeseqDataSet, call dds.deseq2(), extract DeseqStats with contrast, run Wald test, optionally apply LFC shrinkage. See pydeseq2_workflow.md.

Tool boundaries:

  • Python (PyDESeq2): ALL DESeq2 analysis
  • ToolUniverse: ONLY gene annotation (ID conversion, pathway context)
  • gseapy: Enrichment analysis (GO/KEGG/Reactome)

Step 4: Filter Results

Apply padj, log2FC, baseMean thresholds. Split by direction if needed. See result_filtering.md.

Step 5: Dispersion Analysis (if asked)

Key columns: genewise_dispersions, fitted_dispersions, MAP_dispersions, dispersions. See dispersion_analysis.md.

Step 6: Enrichment (optional)

Use gseapy enrich() with appropriate gene set library. See enrichment_analysis.md.

Step 7: Gene Annotation (optional)

Use ToolUniverse for ID conversion and gene context only. See output_formatting.md.

Common Patterns

Pattern Type Key Operation
1 DEG count len(results[(padj<0.05) & (abs(lfc)>0.5)])
2 Gene value results.loc['GENE', 'log2FoldChange']
3 Direction Filter log2FoldChange > 0 or < 0
4 Set ops degs_A - degs_B for unique DEGs
5 Dispersion (dds.var['genewise_dispersions'] < thr).sum()

See bixbench_examples.md for all 10 patterns with examples.

Error Quick Reference

Error Fix
No matching samples Transpose counts; strip whitespace
Dispersion trend no converge fit_type='mean'
Contrast not found Check metadata['factor'].unique()
Non-integer counts Round to int OR use t-test
NaN in padj Independent filtering removed genes

See troubleshooting.md for full debugging guide.

Known Limitations

  • PyDESeq2 vs R DESeq2: Numerical differences exist for very low dispersion genes (<1e-05). For exact R reproducibility, use rpy2.
  • gseapy vs R clusterProfiler: Results may differ. See r_clusterprofiler_guide.md.

Reference Files

Utility Scripts

Weekly Installs
113
GitHub Stars
1.1K
First Seen
Feb 19, 2026
Installed on
gemini-cli110
codex110
opencode109
github-copilot109
cursor107
amp106