tooluniverse-rnaseq-deseq2

Installation

SKILL.md

RNA-seq Differential Expression Analysis (DESeq2)

Comprehensive differential expression analysis of RNA-seq count data using PyDESeq2, with integrated enrichment analysis (gseapy) and gene annotation via ToolUniverse.

BixBench Coverage: Validated on 53 BixBench questions across 15 computational biology projects covering RNA-seq, miRNA-seq, and differential expression analysis tasks.

Core Principles

Data-first approach - Load and validate count data and metadata BEFORE any analysis
Statistical rigor - Always use proper normalization, dispersion estimation, and multiple testing correction
Flexible design - Support single-factor, multi-factor, and interaction designs
Threshold awareness - Apply user-specified thresholds exactly (padj, log2FC, baseMean)
Reproducible - Set random seeds, document all parameters, output complete results
Question-driven - Parse what the user is actually asking and extract the specific answer
Enrichment integration - Chain DESeq2 results into pathway/GO enrichment when requested
English-first queries - Use English gene/pathway names in all tool calls

When to Use This Skill

Apply when users:

Have RNA-seq count matrices and want differential expression analysis
Ask about DESeq2, DEGs, differential expression, padj, log2FC
Need dispersion estimates or diagnostics
Want enrichment analysis (GO, KEGG, Reactome) on DEGs
Ask about specific gene expression changes between conditions
Need to compare multiple strains/conditions/treatments
Ask about batch effect correction in RNA-seq
Questions mention "count data", "count matrix", "RNA-seq", "transcriptomics"

Required Packages

# Core (MUST be installed)
import pandas as pd
import numpy as np
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

# Enrichment (optional, for GO/KEGG/Reactome)
import gseapy as gp

# ToolUniverse (optional, for gene annotation)
from tooluniverse import ToolUniverse

Installation:

pip install pydeseq2 gseapy pandas numpy scipy anndata

Analysis Workflow

Step 1: Question Parsing

CRITICAL FIRST STEP: Before writing ANY code, parse the question to identify:

Data files: Look for *counts*.csv, *metadata*.csv, *.h5ad
Thresholds: Extract padj (default 0.05), log2FC (default 0), baseMean (default 0)
Design: Identify factors mentioned ("strain", "condition", "batch")
Contrast: Determine comparison ("A vs B", "mutant vs wildtype")
Direction: Check if "upregulated", "downregulated", or both
Enrichment: Look for "GO", "KEGG", "Reactome", "pathway"
Specific genes: Check if asking about individual genes

See references/question_parsing.md for detailed parsing patterns.

Step 1.5: Design Formula Decision Tree ⚠️ CRITICAL

ALWAYS inspect metadata for ALL variables, not just what the question mentions!

Many experiments have hidden batch effects (media conditions, sequencing batches, time points) that MUST be included as covariates. Failing to account for these reduces statistical power and can lead to incorrect results.

Decision process:

List ALL metadata columns (don't skip any!)
Categorize each column:
- Biological interest: The factor you're testing (strain, treatment, genotype, condition)
- Batch/Block: Systematic covariates (media, batch, sequencing_run, time, plate)
- Irrelevant: Sample IDs, notes, file names
Design formula:
- Single factor: ~condition (only if no batch variables exist)
- With covariates: ~batch1 + batch2 + condition (covariates first!)
- Interaction: ~batch + factor1 + factor2 + factor1:factor2

Example (critical real-world case):

Metadata columns: [Strain, Media, Replicate]
Question asks: "strain effects"

❌ WRONG: design="~Strain" (ignores Media!)
✅ CORRECT: design="~Media + Strain" (accounts for media variation)

Why: Even though question doesn't mention media, it systematically
affects expression. DESeq2 must remove media effects to properly test strain.

Rule of thumb: If a column has 2+ levels and represents an experimental condition (not sample ID), include it in design.

Step 2: Data Loading & Validation

Load count matrix and metadata, then validate alignment:

import pandas as pd
from scripts.load_count_matrix import load_count_matrix, validate_inputs

# Load data
counts = load_count_matrix("counts.csv")
metadata = pd.read_csv("metadata.csv", index_col=0)

# Validate and align
counts, metadata, issues = validate_inputs(counts, metadata)

Key considerations:

Ensure samples as rows, genes as columns (PyDESeq2 requirement)
Verify integer counts (round if needed)
Align sample names between counts and metadata
Remove zero-count genes

See references/data_loading.md for detailed data handling.

Step 2.5: Inspect Metadata Structure ⚠️ REQUIRED

Before choosing design formula, ALWAYS inspect metadata to identify all experimental factors:

# Print metadata structure
print("Metadata columns and levels:")
for col in metadata.columns:
    unique_vals = metadata[col].unique()
    print(f"  {col}: {len(unique_vals)} levels → {list(unique_vals)[:5]}")

# Example output:
#   Strain: 4 levels → ['1', '97', '98', '99']
#   Media: 3 levels → ['MMGluFeMinus', 'MMGluFePlus', 'Succinate']
#   Replicate: 3 levels → ['A', 'B', 'C']

# Decision:
#   - Strain: Biological factor (testing)
#   - Media: Batch/covariate (MUST include!)
#   - Replicate: Biological replicate (don't include as factor)
# Design: ~Media + Strain

This step prevents missing hidden batch effects that could invalidate your analysis.

Step 3: Run PyDESeq2

Execute differential expression analysis:

from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

# Setup design (set reference level first)
metadata['condition'] = pd.Categorical(
    metadata['condition'],
    categories=['control', 'treatment']  # First = reference
)

# Run DESeq2
dds = DeseqDataSet(
    counts=counts,
    metadata=metadata,
    design="~condition",
    quiet=True
)
dds.deseq2()

# Extract results
stat_res = DeseqStats(dds, contrast=['condition', 'treatment', 'control'], quiet=True)
stat_res.run_wald_test()
stat_res.summary()

# Apply LFC shrinkage (if needed)
stat_res.lfc_shrink(coeff='condition[T.treatment]')

results = stat_res.results_df

Multi-Factor Design (Common Real-World Case)

When metadata has multiple experimental variables (e.g., strain + media conditions), ALWAYS include covariates:

# Inspect metadata (from Step 2.5) showed:
#   - Strain: 4 levels (factor of interest)
#   - Media: 3 levels (batch effect - MUST include!)

# Set reference levels for BOTH factors
metadata['media'] = pd.Categorical(
    metadata['media'],
    categories=['MMGluFeMinus', 'MMGluFePlus', 'Succinate']  # First = reference
)
metadata['strain'] = pd.Categorical(
    metadata['strain'],
    categories=['1', '97', '98', '99']  # First = reference (JBX1)
)

# Include covariate in design formula
dds = DeseqDataSet(
    counts=counts,
    metadata=metadata,
    design="~media + strain",  # Covariate first, then factor!
    quiet=True
)
dds.deseq2()

# Extract strain effect (controlling for media)
stat_res = DeseqStats(dds, contrast=['strain', '98', '1'], quiet=True)
stat_res.run_wald_test()
stat_res.summary()
results = stat_res.results_df

Why this matters: DESeq2 will model media effects separately, removing media-driven variance before testing strain differences. This increases statistical power and prevents false positives/negatives.

When to use what:

Python (PyDESeq2): Use for ALL DESeq2 analysis (normalization, testing, filtering)
ToolUniverse: Use ONLY for gene annotation (ID conversion, pathway context)
gseapy: Use for enrichment analysis (GO/KEGG/Reactome)

See references/pydeseq2_workflow.md for complete PyDESeq2 patterns including batch effects, interaction terms, and complex designs.

Step 4: Filter Results

Apply thresholds from the question:

# Filter DEGs
sig_genes = results[
    (results['padj'] < 0.05) &
    (results['log2FoldChange'].abs() > 0.5) &
    (results['baseMean'] > 10)
]

# Direction-specific filtering
up_genes = sig_genes[sig_genes['log2FoldChange'] > 0]
down_genes = sig_genes[sig_genes['log2FoldChange'] < 0]

Common filtering patterns:

Basic DEG count: len(sig_genes)
Specific gene value: results.loc['GENE_NAME', 'log2FoldChange']
Set operations: Use Python sets for "unique", "shared", "overlap"

See references/result_filtering.md for advanced filtering.

Step 5: Dispersion Analysis (if asked)

Access dispersion estimates:

# Get dispersion data
disp_data = dds.var  # Dispersions stored here

# Common question: "genes below threshold prior to fitting"
genewise = dds.var['genewise_dispersions']
count_below = (genewise < 1e-5).sum()

Dispersion column mapping:

"prior to fitting" → genewise_dispersions
"fitted dispersions" → fitted_dispersions
"after shrinkage" / "MAP" → MAP_dispersions
"final" → dispersions

See references/dispersion_analysis.md for diagnostics.

Step 6: Enrichment Analysis (optional)

Run pathway enrichment on DEGs:

import gseapy as gp

# Prepare gene list
gene_list = sig_genes.index.tolist()

# Run enrichment
enr = gp.enrich(
    gene_list=gene_list,
    gene_sets='GO_Biological_Process_2023',  # or KEGG, Reactome
    background=None,  # or specify background
    outdir=None,
    cutoff=0.05,
    no_plot=True,
    verbose=False
)

# Extract results
top_pathways = enr.results.head(10)

Library selection:

Human GO: GO_Biological_Process_2023
Mouse KEGG: KEGG_2019_Mouse
Human KEGG: KEGG_2021_Human
Reactome: Reactome_2022

See references/enrichment_analysis.md for complete enrichment workflows.

Step 7: Gene Annotation with ToolUniverse (optional)

Use ToolUniverse ONLY for gene annotation, not analysis:

from tooluniverse import ToolUniverse

tu = ToolUniverse()
tu.load_tools()

# Gene ID conversion
result = tu.tools.MyGene_query_genes(query="TP53")

# Gene details
result = tu.tools.ensembl_lookup_gene(
    gene_id="ENSG00000141510",
    species="homo_sapiens"
)

Do NOT use ToolUniverse for:

Differential expression (use PyDESeq2)
Statistical testing (use scipy.stats)
Enrichment analysis (use gseapy)

Output Formatting

Match the question's requested format:

# Numeric precision
round(value, 2)  # "2 decimal points"
f"{value:.2E}"   # "scientific notation"

# Percentages
f"{value * 100:.1f}%"  # "as percentage"

# Counts (no decimals)
int(len(sig_genes))  # "how many genes"

See references/output_formatting.md for all format patterns.

Common BixBench Patterns

Pattern 1: Basic DEG Count

Question: "How many genes show significant DE (padj < 0.05, |log2FC| > 0.5)?"

degs = results[(results['padj'] < 0.05) & (results['log2FoldChange'].abs() > 0.5)]
answer = len(degs)

Pattern 2: Specific Gene Value

Question: "What is the log2FC of gene X?"

answer = round(results.loc['GENE_X', 'log2FoldChange'], 2)

Pattern 3: Direction-Specific

Question: "How many genes are upregulated?"

up_degs = results[(results['padj'] < 0.05) & (results['log2FoldChange'] > 0)]
answer = len(up_degs)

Pattern 4: Set Operations

Question: "How many genes are uniquely DE in condition A?"

degs_A = set(results_A[results_A['padj'] < 0.05].index)
degs_B = set(results_B[results_B['padj'] < 0.05].index)
unique_A = degs_A - degs_B
answer = len(unique_A)

Pattern 5: Dispersion Count

Question: "How many genes have dispersion below 1e-5 prior to fitting?"

genewise = dds.var['genewise_dispersions']
answer = (genewise < 1e-5).sum()

See references/bixbench_examples.md for all 10 patterns with examples.

Error Handling

Error	Solution
"No matching samples"	Check if counts need transposing; strip whitespace
"Dispersion trend did not converge"	Use `fit_type='mean'`
"Contrast not found"	Check `metadata['factor'].unique()` for exact names
"Non-integer counts"	Round to integers OR use t-test for normalized data
"NaN in padj"	Independent filtering removed genes; exclude from counts

See references/troubleshooting.md for complete debugging guide.

Validation Checklist

Every analysis MUST include:

Data Loading:

Count matrix oriented correctly (samples as rows, genes as columns)
Metadata aligned with counts
Integer counts validated

DESeq2 Analysis:

Design formula matches question
Reference level set correctly (first in Categorical)
Correct contrast extracted

Results:

Thresholds match question exactly
Direction filter applied if specified
Answer formatted correctly (decimal places, notation)

Quality Checks:

DEG count is reasonable
P-values between 0 and 1
Log2FC values are finite

Known Limitations

PyDESeq2 vs R DESeq2 Differences

This skill uses PyDESeq2 (Python implementation) for differential expression analysis. While PyDESeq2 faithfully implements the DESeq2 algorithm, numerical differences exist between Python and R implementations:

Dispersion Estimation:

Dispersion estimates may differ from R DESeq2, especially for very low dispersion genes (< 1e-05)
This is due to different numerical optimization methods between Python and R statistical libraries
Results are still statistically valid and biologically meaningful
If you need exact R DESeq2 dispersions for benchmark reproducibility, consider using R DESeq2 directly via rpy2

For Most Analyses: PyDESeq2 provides accurate, publication-quality results. The differences are in numerical precision, not statistical validity.

GO/KEGG Enrichment (gseapy vs clusterProfiler)

Enrichment analysis uses Python gseapy by default. Results may differ from R clusterProfiler due to:

Different enrichment algorithms and statistical tests
Different GO/KEGG database versions
Different term simplification/redundancy removal methods

For R clusterProfiler Compatibility: If you need results matching R clusterProfiler exactly (e.g., for benchmark reproducibility):

# Install rpy2 and R packages first:
# pip install rpy2
# In R: install.packages(c("clusterProfiler", "org.Hs.eg.db", "enrichplot"))

from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri, vectors
pandas2ri.activate()

# Load R packages
clusterProfiler = importr('clusterProfiler')
orgdb = importr('org.Hs.eg.db')

# Convert Python gene list to R vector
gene_list = sig_genes.index.tolist()
r_genes = vectors.StrVector(gene_list)

# Run enrichGO (R's clusterProfiler)
enrich_result = clusterProfiler.enrichGO(
    gene=r_genes,
    OrgDb=orgdb.org_Hs_eg_db,
    ont='BP',  # Biological Process
    pAdjustMethod='BH',
    pvalueCutoff=0.05,
    qvalueCutoff=0.2
)

# Apply simplify to remove redundant terms
simplified = clusterProfiler.simplify(enrich_result, cutoff=0.7, by='p.adjust')

# Convert R result back to pandas
results_df = pandas2ri.rpy2py(simplified)

Default behavior: Uses gseapy (no R dependencies required). This is sufficient for most analyses and provides valid enrichment results.

When to use R clusterProfiler:

When exact reproducibility with R-based benchmarks is required
When collaborating with R users who need identical results
When specific clusterProfiler features (GSEA, compareCluster) are needed

References

Detailed documentation for advanced topics:

question_parsing.md - Extract parameters from questions
data_loading.md - Data loading and validation patterns
pydeseq2_workflow.md - Complete PyDESeq2 code examples
result_filtering.md - Advanced filtering and extraction
dispersion_analysis.md - Dispersion diagnostics
enrichment_analysis.md - GO/KEGG/Reactome workflows
output_formatting.md - Format answers correctly
bixbench_examples.md - All 10 question patterns
troubleshooting.md - Common issues and debugging

Utility Scripts

Helper scripts for common tasks:

format_deseq2_output.py - Output formatters
load_count_matrix.py - Data loading utilities

Related skills

More from wu-yc/labclaw

Installs

Repository

wu-yc/labclaw

GitHub Stars

981

First Seen

Mar 15, 2026