skills/gexijin/vitiligo/rnaseq-analysis

rnaseq-analysis

Installation
SKILL.md

Bulk RNA-seq Analysis Skill

This skill guides analysis of bulk RNA-seq gene expression count data using DESeq2, following established patterns from this project.

Quick Start

When asked to analyze RNA-seq data:

  1. Check if a CLAUDE.md file exists in the study directory - it contains study-specific metadata
  2. Determine the data format (count matrix, featureCounts output, RSEM, etc.)
  3. Create numbered scripts in scripts/ directory (01_, 02_, etc.)
  4. Update analysis.md after each analysis step
  5. Generate all plots in results/plots/ with plots/plots.md documentation

Directory Structure

Create this structure for each analysis:

analyses/[FirstAuthor]_[GSE#]/
├── CLAUDE.md           # Study metadata (samples, platform, design)
├── analysis.md         # Progressive documentation (update after each step)
├── scripts/
│   ├── 01_load_and_qc.R
│   ├── 02_deseq2_analysis.R
│   ├── 03_differential_expression.R
│   ├── 04_annotate_genes.R
│   └── 05_go_enrichment.R
├── data/               # Raw and processed data (gitignored)
└── results/
    ├── tables/         # CSV output files
    └── plots/          # PNG/PDF figures
        └── plots.md    # Plot documentation

Gene Tracking

IMPORTANT: Track gene counts at each processing step and report a summary table at the end of QC and DESeq2 scripts. This provides transparency about data filtering.

Initialize Tracking

# Initialize gene tracking at the start of script
gene_tracking <- data.frame(
  step = character(),
  n_genes = integer(),
  n_removed = integer(),
  description = character(),
  stringsAsFactors = FALSE
)

Track at Each Step

Add tracking after each filtering/processing step:

# After loading raw counts
gene_tracking <- rbind(gene_tracking, data.frame(
  step = "1. Raw counts loaded",
  n_genes = nrow(counts),
  n_removed = 0,
  description = "Total genes in count matrix"
))

# After pre-filtering low counts
gene_tracking <- rbind(gene_tracking, data.frame(
  step = "2. Pre-filtering",
  n_genes = nrow(counts_filtered),
  n_removed = nrow(counts) - nrow(counts_filtered),
  description = sprintf("Genes with >= %d counts in >= %d samples", min_count, min_samples)
))

# After DESeq2 independent filtering
gene_tracking <- rbind(gene_tracking, data.frame(
  step = "3. DESeq2 independent filtering",
  n_genes = sum(!is.na(res$padj)),
  n_removed = sum(is.na(res$padj)),
  description = "Genes with sufficient counts for testing"
))

# After annotation (genes with valid IDs)
gene_tracking <- rbind(gene_tracking, data.frame(
  step = "4. Gene annotation",
  n_genes = sum(!is.na(res_annotated$gene_symbol)),
  n_removed = sum(is.na(res_annotated$gene_symbol)),
  description = "Genes with valid gene symbols"
))

Report Summary

At the end of analysis script:

# Calculate percentage remaining
gene_tracking$pct_remaining <- round(100 * gene_tracking$n_genes / gene_tracking$n_genes[1], 1)

cat("\n=== Gene Tracking Summary ===\n\n")
print(gene_tracking)

# Save to file
write.csv(gene_tracking, file.path(tables_dir, "gene_tracking.csv"), row.names = FALSE)

Expected Output

Step Genes Removed % Remaining Description
1. Raw counts loaded 60,623 0 100.0% Total genes in count matrix
2. Pre-filtering 18,432 42,191 30.4% Genes with >= 10 counts in >= 3 samples
3. DESeq2 independent filtering 15,821 2,611 26.1% Genes with sufficient counts for testing
4. Gene annotation 14,956 865 24.7% Genes with valid gene symbols

Data Loading

Standard Script Header with Relative Paths

Use the here package for portable, relative paths:

.libPaths(c("~/R/library", .libPaths()))
suppressPackageStartupMessages({
  library(DESeq2)
  library(ggplot2)
  library(dplyr)
  library(tidyr)
  library(pheatmap)
  library(RColorBrewer)
  library(here)
})

# Source utility functions
source(here("scripts/utils/filter_counts.R"))

# Set analysis directory using here package
analysis_dir <- here("analyses", "Author_GSE#####")

# Create output directories
dir.create(file.path(analysis_dir, "results/tables"),
           recursive = TRUE, showWarnings = FALSE)
dir.create(file.path(analysis_dir, "results/plots"),
           recursive = TRUE, showWarnings = FALSE)

Count Matrix Format

DESeq2 expects raw (non-normalized) integer counts:

# Load count matrix (genes as rows, samples as columns)
counts <- read.csv(file.path(analysis_dir, "data/counts.csv"), row.names = 1)
counts <- as.matrix(counts)

# Ensure integer counts
storage.mode(counts) <- "integer"

# Load sample metadata
sample_info <- read.csv(file.path(analysis_dir, "data/sample_metadata.csv"))
rownames(sample_info) <- sample_info$sample_id

# Ensure sample order matches
counts <- counts[, rownames(sample_info)]

From GEO Downloads

GEO data may come in various formats:

# Typical GEO series matrix
geo_data <- read.delim("GSE#####_count_matrix.txt.gz", row.names = 1)

# featureCounts output
fc_data <- read.delim("featureCounts_output.txt", comment.char = "#")
counts <- fc_data[, 7:ncol(fc_data)]
rownames(counts) <- fc_data$Geneid

# RSEM expected counts (round to integers)
rsem_data <- read.delim("rsem_genes.txt", row.names = 1)
counts <- round(rsem_data[, grep("expected_count", colnames(rsem_data))])

Quality Control

Pre-filtering

Use the filter_counts() utility function for CPM-based and total count filtering:

# Source the utility function
source(here("scripts/utils/filter_counts.R"))

# Define filtering parameters
FILTER_MIN_CPM <- 2              # Minimum CPM threshold
FILTER_MIN_SAMPLE_PERCENT <- 0.3 # Min % of samples (30%)
FILTER_MIN_TOTAL_COUNTS <- 20    # Min total counts across all samples

# Filter using CPM and total count thresholds
counts_filtered <- filter_counts(
  counts,
  min_cpm = FILTER_MIN_CPM,
  min_sample_percent = FILTER_MIN_SAMPLE_PERCENT,
  min_total_counts = FILTER_MIN_TOTAL_COUNTS
)

The filter_counts() function (in scripts/utils/filter_counts.R) filters genes that meet both:

  1. CPM threshold in a minimum percentage of samples
  2. Minimum total counts across all samples

This produces informative output:

Filtering statistics:
  Total samples: 10
  Min samples required: 3 (30%)
  CPM threshold: 2
  Total count threshold: 20
  Total genes: 60623
  Genes passing CPM filter: 15432
  Genes passing total counts filter: 18234
  Genes passing both filters: 14521
  Genes removed: 46102

QC Metrics

# Library sizes (total counts per sample)
lib_sizes <- colSums(counts_filtered)

# Number of detected genes per sample
genes_detected <- colSums(counts_filtered > 0)

# Create QC summary
qc_summary <- data.frame(
  sample_id = colnames(counts_filtered),
  library_size = lib_sizes,
  genes_detected = genes_detected,
  condition = sample_info$condition
)

# Z-score for outlier detection
qc_summary$lib_size_zscore <- scale(qc_summary$library_size)[, 1]
qc_summary$genes_zscore <- scale(qc_summary$genes_detected)[, 1]

# Flag potential outliers
qc_summary$outlier_flag <- abs(qc_summary$lib_size_zscore) > 2 |
                           abs(qc_summary$genes_zscore) > 2

Quality Control Plots

Generate these QC plots (numbered sequentially):

  1. Library size barplot - Total counts per sample
  2. Genes detected barplot - Number of genes with >0 counts per sample
  3. Count distribution boxplot - Log2(count+1) distribution per sample
  4. Count density plot - Overlapping density curves
  5. Sample correlation heatmap - Hierarchical clustering
  6. PCA plot - PC1 vs PC2 colored by condition
  7. Sample distance heatmap - Euclidean distance after VST

Post-normalization visualization:

  • Overall expression heatmap (e.g., 08b) - Top 2000 most variable genes
    • Use VST-transformed counts
    • Median-centered, ±3 SD clipping for color utilization
    • 1 - Pearson correlation distance, average linkage for gene clustering
    • Samples ordered by condition (not clustered)
    • Green-black-red color scheme with condition color bars
  • Sample dendrogram (e.g., 08c) - Hierarchical clustering of all samples
    • Uses all genes from VST-transformed counts
    • 1 - Pearson correlation distance, average linkage
    • Horizontal layout with branches and labels colored by condition

Outlier Detection Criteria

Metric Threshold Action
Library size z-score z
Genes detected z-score z
PCA extreme Visual outlier Investigate
Low correlation < 0.8 with group Investigate

After excluding outliers: Re-create DESeqDataSet with filtered samples.

DESeq2 Analysis

Create DESeqDataSet

library(DESeq2)

# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(
  countData = counts_filtered,
  colData = sample_info,
  design = ~ condition  # Basic unpaired design
)

# Set reference level
dds$condition <- relevel(dds$condition, ref = "Control")

Paired Design (same subjects, multiple conditions)

For matched samples (e.g., lesional vs non-lesional from same patient):

# Ensure factors are properly set
dds$condition <- factor(sample_info$condition, levels = c("NonLesional", "Lesional"))
dds$patient <- factor(sample_info$patient_id)

# Design with patient as blocking factor
design(dds) <- ~ patient + condition

Unpaired Design (independent groups)

For independent samples:

# Simple two-group comparison
dds$condition <- factor(sample_info$condition, levels = c("Control", "Treatment"))
design(dds) <- ~ condition

Multi-factor Design

For multiple variables:

# Treatment and batch effects
dds$treatment <- factor(sample_info$treatment)
dds$batch <- factor(sample_info$batch)
design(dds) <- ~ batch + treatment

# Interaction model
dds$genotype <- factor(sample_info$genotype)
dds$treatment <- factor(sample_info$treatment)
design(dds) <- ~ genotype + treatment + genotype:treatment

Run DESeq2

# Run the differential expression pipeline
dds <- DESeq(dds)

# Check available coefficients
resultsNames(dds)

Extracting Results

Basic Results Extraction

# Extract results for a specific contrast
res <- results(dds,
               name = "condition_Treatment_vs_Control",  # Or use contrast
               alpha = 0.05)

# Or use contrast argument
res <- results(dds,
               contrast = c("condition", "Treatment", "Control"),
               alpha = 0.05)

# Summary
summary(res)

# Convert to data frame and sort
res_df <- as.data.frame(res)
res_df$gene_id <- rownames(res_df)
res_df <- res_df[order(res_df$pvalue), ]

Log Fold Change Shrinkage

Always apply LFC shrinkage for better estimates and ranking:

# Use apeglm method (recommended)
library(apeglm)
res_shrunk <- lfcShrink(dds,
                        coef = "condition_Treatment_vs_Control",
                        type = "apeglm")

# Alternative: ashr method
res_shrunk <- lfcShrink(dds,
                        contrast = c("condition", "Treatment", "Control"),
                        type = "ashr")

Multiple Contrasts

# Define all contrasts of interest
contrasts <- list(
  Treatment_vs_Control = c("condition", "Treatment", "Control"),
  TimePoint2_vs_TimePoint1 = c("timepoint", "T2", "T1")
)

# Extract all results
all_results <- lapply(names(contrasts), function(name) {
  res <- results(dds, contrast = contrasts[[name]], alpha = 0.05)
  res_df <- as.data.frame(res)
  res_df$gene_id <- rownames(res_df)
  res_df$contrast <- name
  return(res_df)
})

names(all_results) <- names(contrasts)

Variance Stabilizing Transformation

Use VST or rlog for visualization (PCA, heatmaps):

# VST (faster, recommended for n > 30)
vsd <- vst(dds, blind = FALSE)

# rlog (better for small sample sizes n < 30)
rld <- rlog(dds, blind = FALSE)

# Extract transformed values
vst_mat <- assay(vsd)

Model Diagnostics

Always generate these diagnostic plots:

  1. P-value histogram - Should show uniform + spike near 0
  2. MA plot - Check for intensity-dependent bias
  3. Dispersion plot - Check model fit
# P-value histogram
hist(res$pvalue, breaks = 50,
     main = "P-value Distribution", xlab = "P-value")
abline(h = nrow(res) / 50, col = "red", lty = 2)

# MA plot
plotMA(res, main = "MA Plot", ylim = c(-5, 5))

# Dispersion plot
plotDispEsts(dds, main = "Dispersion Estimates")

Gene ID Mapping

Using org.Hs.eg.db

library(AnnotationDbi)
library(org.Hs.eg.db)

# Get gene IDs from results
gene_ids <- rownames(res)

# Map Ensembl to symbols (if starting with Ensembl IDs)
# Remove version numbers if present
gene_ids_clean <- gsub("\\.\\d+$", "", gene_ids)

mapping <- AnnotationDbi::select(
  org.Hs.eg.db,
  keys = gene_ids_clean,
  columns = c("SYMBOL", "ENTREZID"),
  keytype = "ENSEMBL"
)

colnames(mapping) <- c("ensembl_id", "gene_symbol", "entrez_id")

# Handle duplicates (one Ensembl -> multiple mappings)
mapping_unique <- mapping %>%
  group_by(ensembl_id) %>%
  summarize(
    gene_symbol = first(na.omit(gene_symbol)),
    entrez_id = first(na.omit(entrez_id))
  )

Using biomaRt

library(biomaRt)

ensembl <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl")

mapping <- getBM(
  attributes = c("ensembl_gene_id", "hgnc_symbol", "entrezgene_id"),
  filters = "ensembl_gene_id",
  values = gene_ids_clean,
  mart = ensembl
)

Output Format

Differential Expression Results

Produce output files with both unshrunken and shrunken log2 fold changes. The shrunken LFC should be used for ranking and visualization, while unshrunken values are retained for reference.

Required columns in output CSV:

Column Description
gene_id Original gene ID, Ensembl, or Entrez
baseMean Mean of normalized counts
log2FoldChange Shrunken log2 fold change (use for ranking/visualization)
log2FoldChange_unshrunken Unshrunken log2 fold change (for reference)
lfcSE Standard error of shrunken LFC
stat Wald statistic (from unshrunken results)
pvalue Raw p-value
padj BH-adjusted p-value
entrez_id Entrez gene ID (if annotated)
gene_symbol Gene symbol (if annotated)

Important: Use type = "normal" for LFC shrinkage when using paired designs, as apeglm may not work with all coefficient names:

# Apply LFC shrinkage
res_shrunk <- lfcShrink(dds,
                        coef = "condition_Lesional_vs_NonLesional",
                        type = "normal")

# Combine shrunken and unshrunken results
res_df <- as.data.frame(res_shrunk)
res_df$gene_id <- rownames(res_df)
res_df$log2FoldChange_unshrunken <- res_unshrunken$log2FoldChange
res_df$stat <- res_unshrunken$stat

For multiple gene IDs mapped to the same symbol, keep the most significant (lowest P-value). Unmapped genes can be deleted from annotated output files.

Similarily,

File Naming Convention

results/tables/
├── sample_metadata.csv
├── qc_metrics.csv
├── normalized_counts.csv              # DESeq2 normalized counts
├── vst_counts.csv                     # VST transformed counts
├── DE_[Comparison]_full.csv           # Full results
├── DE_[Comparison]_annotated.csv      # With gene annotation
├── DE_[Comparison]_significant.csv    # FDR < 0.05 only
├── genes_upregulated.csv
├── genes_downregulated.csv
├── GO_BP_[Comparison]_UP.csv
├── GO_BP_[Comparison]_DOWN.csv
└── GO_BP_summary.csv

Progressive Documentation (analysis.md)

Update analysis.md after EACH analysis step with:

# [GSE#] RNA-seq Analysis Report

## Study Information

| Field | Value |
|-------|-------|
| **GEO Accession** | GSE##### |
| **Data Type** | Bulk RNA-seq |
| **Analysis Date** | YYYY-MM-DD |

## Methods

### Software Environment

| Package | Version | Purpose |
|---------|---------|---------|
| DESeq2 | X.X.X | Differential expression |
| apeglm | X.X.X | LFC shrinkage |

### Key Parameters
- Pre-filtering: Genes with ≥10 counts in ≥3 samples
- FDR threshold: 0.05
- LFC shrinkage method: apeglm

## Results

### Quality Control
- [Summarize QC findings]
- [Document any outliers removed]

### Differential Expression
| Comparison | Design | Total DEGs | Up | Down |
|------------|--------|------------|-----|------|
| ... | ... | ... | ... | ... |

### Key Findings
- [Biological interpretation]

## Output Files
[List all generated files]

Plot Documentation (plots/plots.md)

Create results/plots/plots.md with embedded image references:

# Plot Documentation - [GSE#]

## Quality Control

### 01_library_sizes.png
Library size (total counts) per sample.

![Library sizes](./01_library_sizes.png)

### 02_genes_detected.png
Number of genes detected (count > 0) per sample.

![Genes detected](./02_genes_detected.png)

### 03_count_distribution.png
Boxplot of log2(count+1) distribution per sample.

![Count distribution](./03_count_distribution.png)

[Continue for all plots...]

## Differential Expression

### 15_volcano_plot.png
Volcano plot showing -log10(p-value) vs log2 fold change.
- Red: Upregulated (FDR < 0.05, |logFC| > 1)
- Blue: Downregulated (FDR < 0.05, |logFC| > 1)

![Volcano plot](./15_volcano_plot.png)

GO and KEGG Enrichment Analysis

Use clusterProfiler for pathway analysis. Run both GO Biological Process and KEGG pathway enrichment for both upregulated AND downregulated genes in each comparison.

Background Gene List

Important: Use all genes that passed the filtering step as background:

library(clusterProfiler)
library(org.Hs.eg.db)
library(ggplot2)
library(dplyr)

# Background = all genes tested (from annotated results)
background_genes <- res_annotated %>%
  filter(!is.na(entrez_id)) %>%
  pull(entrez_id) %>%
  unique() %>%
  as.character()

GO Biological Process Enrichment (Both Directions)

# Get Entrez IDs for UPREGULATED genes
sig_up <- res_annotated %>%
  filter(padj < 0.05 & log2FoldChange > 0 & !is.na(entrez_id)) %>%
  pull(entrez_id) %>%
  unique() %>%
  as.character()

# Get Entrez IDs for DOWNREGULATED genes
sig_down <- res_annotated %>%
  filter(padj < 0.05 & log2FoldChange < 0 & !is.na(entrez_id)) %>%
  pull(entrez_id) %>%
  unique() %>%
  as.character()

# GO enrichment - Upregulated
ego_up <- enrichGO(
  gene = sig_up,
  universe = background_genes,
  OrgDb = org.Hs.eg.db,
  keyType = "ENTREZID",
  ont = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  readable = TRUE
)
ego_up_simp <- simplify(ego_up, cutoff = 0.7, by = "p.adjust")

# GO enrichment - Downregulated
ego_down <- enrichGO(
  gene = sig_down,
  universe = background_genes,
  OrgDb = org.Hs.eg.db,
  keyType = "ENTREZID",
  ont = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  readable = TRUE
)
ego_down_simp <- simplify(ego_down, cutoff = 0.7, by = "p.adjust")

KEGG Pathway Enrichment (Both Directions)

# KEGG - Upregulated
kegg_up <- enrichKEGG(
  gene = sig_up,
  universe = background_genes,
  organism = "hsa",
  keyType = "ncbi-geneid",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05
)

# KEGG - Downregulated
kegg_down <- enrichKEGG(
  gene = sig_down,
  universe = background_genes,
  organism = "hsa",
  keyType = "ncbi-geneid",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05
)

Combined Plots with Fold Enrichment

Create combined diverging plots showing both upregulated and downregulated pathways on the same figure, with fold enrichment on the x-axis:

# Helper function to calculate fold enrichment
calculate_fold_enrichment <- function(enrichResult) {
  if (is.null(enrichResult) || nrow(enrichResult@result) == 0) return(NULL)

  df <- as.data.frame(enrichResult)
  df <- df %>%
    mutate(
      gene_ratio_num = as.numeric(sub("/.*", "", GeneRatio)),
      gene_ratio_denom = as.numeric(sub(".*/", "", GeneRatio)),
      bg_ratio_num = as.numeric(sub("/.*", "", BgRatio)),
      bg_ratio_denom = as.numeric(sub(".*/", "", BgRatio)),
      GeneRatioValue = gene_ratio_num / gene_ratio_denom,
      BgRatioValue = bg_ratio_num / bg_ratio_denom,
      FoldEnrichment = GeneRatioValue / BgRatioValue
    )
  return(df)
}

# Helper function to create combined plot
create_combined_enrichment_plot <- function(df_up, df_down, title, n_terms = 10) {
  plot_data <- data.frame()

  if (!is.null(df_up) && nrow(df_up) > 0) {
    up_top <- df_up %>%
      filter(p.adjust < 0.05) %>%
      arrange(p.adjust) %>%
      head(n_terms) %>%
      mutate(Direction = "Upregulated")
    plot_data <- bind_rows(plot_data, up_top)
  }

  if (!is.null(df_down) && nrow(df_down) > 0) {
    down_top <- df_down %>%
      filter(p.adjust < 0.05) %>%
      arrange(p.adjust) %>%
      head(n_terms) %>%
      mutate(Direction = "Downregulated")
    plot_data <- bind_rows(plot_data, down_top)
  }

  if (nrow(plot_data) == 0) return(NULL)

  # Truncate descriptions and create signed fold enrichment
  plot_data <- plot_data %>%
    mutate(
      Description_short = ifelse(nchar(Description) > 45,
                                  paste0(substr(Description, 1, 42), "..."),
                                  Description),
      FoldEnrichment_signed = ifelse(Direction == "Downregulated",
                                      -FoldEnrichment, FoldEnrichment)
    ) %>%
    arrange(Direction, FoldEnrichment) %>%
    mutate(Description_short = factor(Description_short,
                                       levels = unique(Description_short)))

  # Create diverging dot plot
  ggplot(plot_data, aes(x = FoldEnrichment_signed, y = Description_short,
                         fill = Direction, size = Count)) +
    geom_point(shape = 21, alpha = 0.8) +
    geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
    scale_fill_manual(values = c("Upregulated" = "#E74C3C",
                                  "Downregulated" = "#3498DB")) +
    scale_size_continuous(range = c(3, 10), name = "Gene Count") +
    labs(title = title,
         subtitle = sprintf("Top %d terms per direction (FDR < 0.05)", n_terms),
         x = "Fold Enrichment", y = "") +
    theme_bw() +
    theme(plot.title = element_text(hjust = 0.5, face = "bold"),
          axis.text.y = element_text(size = 9))
}

# Generate combined plots
go_up_fe <- calculate_fold_enrichment(ego_up_simp)
go_down_fe <- calculate_fold_enrichment(ego_down_simp)
kegg_up_fe <- calculate_fold_enrichment(kegg_up)
kegg_down_fe <- calculate_fold_enrichment(kegg_down)

p_go <- create_combined_enrichment_plot(go_up_fe, go_down_fe,
                                         "GO Biological Process")
p_kegg <- create_combined_enrichment_plot(kegg_up_fe, kegg_down_fe,
                                           "KEGG Pathways")

ggsave("results/plots/XX_GO_BP_combined.png", p_go, width = 10, height = 8)
ggsave("results/plots/XX_KEGG_combined.png", p_kegg, width = 10, height = 8)

Output Files

For each comparison, produce:

  • One combined GO BP plot: [num]_GO_BP_combined.png (up + down on same figure)
  • One combined KEGG plot: [num]_KEGG_combined.png (up + down on same figure)
  • CSV tables: Separate files for UP and DOWN directions
    • GO_BP_[comparison]_UP.csv
    • GO_BP_[comparison]_DOWN.csv
    • KEGG_[comparison]_UP.csv
    • KEGG_[comparison]_DOWN.csv

GSEA Analysis

For ranked gene set enrichment:

library(fgsea)
library(msigdbr)

# Create ranked gene list (by stat or signed -log10(p))
gene_ranks <- res_annotated %>%
  filter(!is.na(gene_symbol)) %>%
  arrange(desc(stat)) %>%
  distinct(gene_symbol, .keep_all = TRUE)

ranks <- setNames(gene_ranks$stat, gene_ranks$gene_symbol)

# Get gene sets
hallmark <- msigdbr(species = "Homo sapiens", category = "H")
hallmark_list <- split(hallmark$gene_symbol, hallmark$gs_name)

# Run GSEA
fgsea_res <- fgsea(
  pathways = hallmark_list,
  stats = ranks,
  minSize = 15,
  maxSize = 500,
  nperm = 10000
)

Contrast Selection Guidelines

Only extract biologically meaningful contrasts:

Good contrasts:

  • Treatment vs Control
  • Disease vs Healthy
  • Lesional vs Non-Lesional (paired)
  • Timepoint comparisons in time series

Avoid:

  • Redundant comparisons
  • Comparisons without biological interpretation

Common Issues and Solutions

Issue Solution
Few DEGs with FDR < 0.05 Check sample variability; consider relaxing threshold for exploratory analysis
Batch effects visible in PCA Add batch to design formula; consider ComBat-seq for severe effects
Low mapping rate annotation Check gene ID format; remove version numbers from Ensembl IDs
Cook's distance warnings Investigate flagged samples for potential outliers
Dispersion outliers May indicate problematic genes or samples

Reference Files

See these files in the project for complete examples:

  • analyses/Brunner_bulk/ - Pseudobulk RNA-seq with DESeq2 (paired design)
  • analyses/Shiu_bulk/ - Pseudobulk from scRNA-seq
  • scripts/utils/filter_counts.R - CPM-based filtering utility function
Weekly Installs
1
GitHub Stars
1
First Seen
Apr 3, 2026
Installed on
mcpjam1
claude-code1
kilo1
junie1
windsurf1
zencoder1