bio-experimental-design-multiple-testing
Version Compatibility
Reference examples tested with: R stats (base), statsmodels 0.14+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
pip show <package>thenhelp(module.function)to check signatures - R:
packageVersion('<pkg>')then?function_nameto verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Multiple Testing Correction
"Correct p-values for multiple testing" → Adjust raw p-values from thousands of simultaneous tests to control false discovery rate or family-wise error rate.
- R:
p.adjust(pvalues, method = 'BH'),qvalue::qvalue() - Python:
statsmodels.stats.multitest.multipletests()
The Problem
Testing 20,000 genes at p < 0.05 yields ~1,000 false positives by chance. Correction is essential.
Common Methods
Bonferroni (Most Conservative)
# Strict family-wise error rate control
p_adj <- p.adjust(pvalues, method = 'bonferroni')
# Threshold: alpha / n_tests
# Use for: small gene sets, confirmatory studies
Benjamini-Hochberg FDR (Standard)
# Controls false discovery rate
p_adj <- p.adjust(pvalues, method = 'BH')
# Most common for genomics
# FDR 0.05 = expect 5% of significant results to be false
q-value (Recommended for Large-Scale)
Goal: Estimate the false discovery rate for each gene in a genome-wide test while maximizing detection power by estimating the proportion of true nulls.
Approach: Fit the q-value model to the p-value distribution, which estimates pi0 (fraction of true null hypotheses) and converts each p-value to a q-value representing the minimum FDR at which that gene would be called significant.
library(qvalue)
qobj <- qvalue(pvalues)
qvalues <- qobj$qvalues
pi0 <- qobj$pi0 # Estimated proportion of true nulls
# q-value directly estimates FDR for each gene
# More powerful than BH when many true positives exist
Method Selection Guide
| Scenario | Recommended Method | Threshold |
|---|---|---|
| Genome-wide DE | BH or q-value | FDR < 0.05 |
| Candidate genes | Bonferroni | p < 0.05/n |
| Exploratory | BH | FDR < 0.10 |
| Validation study | Bonferroni | p < 0.05/n |
| GWAS | Bonferroni | p < 5e-8 |
Python Equivalent
from statsmodels.stats.multitest import multipletests
# Benjamini-Hochberg
rejected, pvals_corrected, _, _ = multipletests(pvalues, method='fdr_bh')
# Bonferroni
rejected, pvals_corrected, _, _ = multipletests(pvalues, method='bonferroni')
Interpreting Results
- FDR 0.05: Among genes called significant, ~5% are false positives
- FDR 0.01: More stringent, fewer false positives but more false negatives
- padj vs qvalue: Both estimate FDR; q-value is slightly more powerful
Related Skills
- differential-expression/de-results - Applying corrections to DE output
- population-genetics/association-testing - GWAS significance thresholds
- pathway-analysis/go-enrichment - Correcting enrichment p-values
More from gptomics/bioskills
bioskills
Installs 425 bioinformatics skills covering sequence analysis, RNA-seq, single-cell, variant calling, metagenomics, structural biology, and 56 more categories. Use when setting up bioinformatics capabilities or when a bioinformatics task requires specialized skills not yet installed.
100bio-read-qc-fastp-workflow
All-in-one read preprocessing with fastp including adapter trimming, quality filtering, deduplication, base correction, and HTML report generation. Use when preprocessing Illumina data and wanting a single fast tool instead of separate Cutadapt, Trimmomatic, and FastQC steps.
5bio-data-visualization-genome-tracks
Create genome browser-style visualizations showing multiple data tracks (coverage, peaks, genes) using pyGenomeTracks, Gviz, and IGV. Use when visualizing genomic data at specific loci with multiple aligned tracks.
5bio-metagenomics-kraken
Taxonomic classification of metagenomic reads using Kraken2. Fast k-mer based classification against RefSeq database. Use when performing initial taxonomic classification of shotgun metagenomic reads before abundance estimation with Bracken.
5bio-pathway-go-enrichment
Gene Ontology over-representation analysis using clusterProfiler enrichGO. Use when identifying biological functions enriched in a gene list from differential expression or other analyses. Supports all three ontologies (BP, MF, CC), multiple ID types, and customizable statistical thresholds.
4bio-workflows-microbiome-pipeline
End-to-end 16S amplicon workflow from FASTQ reads to differential abundance. Orchestrates DADA2 ASV inference, taxonomy assignment, diversity analysis, and compositional testing with ALDEx2. Use when processing 16S/ITS amplicon data.
4