bio-pathway-gsea
SKILL.md
Gene Set Enrichment Analysis (GSEA)
Core Concept
GSEA uses all genes ranked by a statistic (log2FC, signed p-value) rather than a subset of significant genes. It finds gene sets where members are enriched at the top or bottom of the ranked list.
Prepare Ranked Gene List
library(clusterProfiler)
library(org.Hs.eg.db)
de_results <- read.csv('de_results.csv')
# Create named vector: values = statistic, names = gene IDs
gene_list <- de_results$log2FoldChange
names(gene_list) <- de_results$gene_id
# Sort in decreasing order (REQUIRED)
gene_list <- sort(gene_list, decreasing = TRUE)
Convert Gene IDs for GSEA
# Convert symbols to Entrez IDs
gene_ids <- bitr(names(gene_list), fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
# Create ranked list with Entrez IDs
gene_list_entrez <- gene_list[names(gene_list) %in% gene_ids$SYMBOL]
names(gene_list_entrez) <- gene_ids$ENTREZID[match(names(gene_list_entrez), gene_ids$SYMBOL)]
gene_list_entrez <- sort(gene_list_entrez, decreasing = TRUE)
Alternative Ranking Statistics
# Signed p-value (recommended for detecting both up and down)
gene_list <- -log10(de_results$pvalue) * sign(de_results$log2FoldChange)
names(gene_list) <- de_results$gene_id
gene_list <- sort(gene_list, decreasing = TRUE)
# Wald statistic (from DESeq2)
gene_list <- de_results$stat
names(gene_list) <- de_results$gene_id
gene_list <- sort(gene_list, decreasing = TRUE)
GSEA with GO
gse_go <- gseGO(
geneList = gene_list_entrez,
OrgDb = org.Hs.eg.db,
ont = 'BP', # BP, MF, CC, or ALL
minGSSize = 10,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE,
pAdjustMethod = 'BH'
)
# Make readable
gse_go <- setReadable(gse_go, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID')
GSEA with KEGG
gse_kegg <- gseKEGG(
geneList = gene_list_entrez,
organism = 'hsa',
minGSSize = 10,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE
)
# Make readable
gse_kegg <- setReadable(gse_kegg, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID')
GSEA with Custom Gene Sets
# Read GMT file (Gene Matrix Transposed)
gene_sets <- read.gmt('msigdb_hallmarks.gmt')
gse_custom <- GSEA(
geneList = gene_list_entrez,
TERM2GENE = gene_sets,
minGSSize = 10,
maxGSSize = 500,
pvalueCutoff = 0.05
)
MSigDB Gene Sets
# Use msigdbr package for MSigDB gene sets
library(msigdbr)
# Hallmark gene sets
hallmarks <- msigdbr(species = 'Homo sapiens', category = 'H')
hallmarks_t2g <- hallmarks[, c('gs_name', 'entrez_gene')]
gse_hallmark <- GSEA(
geneList = gene_list_entrez,
TERM2GENE = hallmarks_t2g,
pvalueCutoff = 0.05
)
# Other categories: C1 (positional), C2 (curated), C3 (motif), C5 (GO), C6 (oncogenic), C7 (immunologic)
Understanding Results
# View results
head(gse_go)
results <- as.data.frame(gse_go)
# Key columns:
# - NES: Normalized Enrichment Score (positive = upregulated, negative = downregulated)
# - pvalue: Nominal p-value
# - p.adjust: FDR-adjusted p-value
# - core_enrichment: Leading edge genes
Interpreting NES (Normalized Enrichment Score)
| NES | Interpretation |
|---|---|
| Positive (> 0) | Gene set enriched in upregulated genes |
| Negative (< 0) | Gene set enriched in downregulated genes |
| NES |
Key Parameters
| Parameter | Default | Description |
|---|---|---|
| geneList | required | Named, sorted numeric vector |
| OrgDb | required | Organism database (for gseGO) |
| organism | hsa | KEGG organism code (for gseKEGG) |
| ont | BP | Ontology: BP, MF, CC, ALL |
| minGSSize | 10 | Min genes in gene set |
| maxGSSize | 500 | Max genes in gene set |
| pvalueCutoff | 0.05 | P-value threshold |
| pAdjustMethod | BH | Adjustment method |
| nPerm | 10000 | Permutations (if permutation test used) |
| eps | 1e-10 | Boundary for p-value calculation |
Export Results
results_df <- as.data.frame(gse_go)
write.csv(results_df, 'gsea_go_results.csv', row.names = FALSE)
# Get leading edge genes for a term
leading_edge <- strsplit(results_df$core_enrichment[1], '/')[[1]]
Notes
- Must be sorted - gene list must be sorted in decreasing order
- Named vector - names are gene IDs, values are statistics
- No arbitrary cutoffs - uses all genes, not just significant ones
- NES sign matters - positive = upregulated enrichment
- Leading edge - core_enrichment contains driving genes
Related Skills
- go-enrichment - Over-representation analysis for GO
- kegg-pathways - Over-representation analysis for KEGG
- enrichment-visualization - GSEA plots, ridge plots
Weekly Installs
3
Repository
gptomics/bioskillsGitHub Stars
349
First Seen
Jan 24, 2026
Security Audits
Installed on
trae2
windsurf1
opencode1
codex1
claude-code1
antigravity1