skills/gexijin/vitiligo/microarray-analysis

microarray-analysis

Installation
SKILL.md

DNA Microarray Analysis Skill

This skill guides analysis of DNA microarray gene expression data following established patterns from this project.

Quick Start

When asked to analyze microarray data:

  1. Check if a CLAUDE.md file exists in the study directory - it contains study-specific metadata
  2. Determine the platform (Affymetrix vs Illumina) to select appropriate methods
  3. Create numbered scripts in scripts/ directory (01_, 02_, etc.)
  4. Update analysis.md after each analysis step
  5. Generate all plots in results/plots/ with plots/plots.md documentation

Directory Structure

Create this structure for each analysis:

analyses/[FirstAuthor]_[GSE#]/
├── CLAUDE.md           # Study metadata (samples, platform, design)
├── analysis.md         # Progressive documentation (update after each step)
├── scripts/
│   ├── 01_load_and_qc.R
│   ├── 02_normalize.R (or 02_normalize_filtered.R)
│   ├── 03_differential_expression.R
│   ├── 04_annotate_genes.R
│   └── 05_go_enrichment.R
├── data/               # Raw and processed data (gitignored)
└── results/
    ├── tables/         # CSV output files
    └── plots/          # PNG/PDF figures
        └── plots.md    # Plot documentation

Script Portability

Use the here package for all file paths to ensure scripts work regardless of working directory:

library(here)

# Set paths relative to project root
base_dir <- here("analyses", "Author_GSE#####")
data_dir <- file.path(base_dir, "data")
results_dir <- file.path(base_dir, "results")
plots_dir <- file.path(results_dir, "plots")
tables_dir <- file.path(results_dir, "tables")

# Source utility functions from project root
source(here("scripts", "utils", "collapse_to_gene.R"))
source(here("scripts", "utils", "enrichment_dotplot.R"))

Never use hardcoded absolute paths like /Users/name/path/to/file. Always use here() for portability.

Probe Tracking

IMPORTANT: Track probe counts at each processing step and report a summary table at the end of QC and normalization scripts. This provides transparency about data filtering.

Initialize Tracking

# Initialize probe tracking at the start of script
probe_tracking <- data.frame(
  step = character(),
  n_probes = integer(),
  n_removed = integer(),
  description = character(),
  stringsAsFactors = FALSE
)

Track at Each Step

Add tracking after each filtering/processing step:

# After loading raw data
probe_tracking <- rbind(probe_tracking, data.frame(
  step = "1. Raw data loaded",
  n_probes = nrow(raw_data),
  n_removed = 0,
  description = "Total probes in raw data file"
))

# After creating expression matrix
probe_tracking <- rbind(probe_tracking, data.frame(
  step = "2. Expression matrix created",
  n_probes = nrow(expr_mat),
  n_removed = nrow(raw_data) - nrow(expr_mat),
  description = "Probes with valid signal columns"
))

# After detection filtering (Illumina)
probe_tracking <- rbind(probe_tracking, data.frame(
  step = "3. Detection filter",
  n_probes = nrow(expr_filtered),
  n_removed = nrow(expr_mat) - nrow(expr_filtered),
  description = sprintf("Probes detected (p < 0.05) in >= %d samples", min_samples_detected)
))

# After low-expression filtering
probe_tracking <- rbind(probe_tracking, data.frame(
  step = "4. Expression filter",
  n_probes = nrow(expr_filtered),
  n_removed = n_before - nrow(expr_filtered),
  description = "Probes with expression above threshold"
))

Report Summary

At the end of QC script:

# Calculate percentage remaining
probe_tracking$pct_remaining <- round(100 * probe_tracking$n_probes / probe_tracking$n_probes[1], 1)

cat("\n=== Probe Tracking Summary ===\n\n")
print(probe_tracking)

# Save to file
write.csv(probe_tracking, file.path(tables_dir, "probe_tracking.csv"), row.names = FALSE)

Expected Output

Step Probes Removed % Remaining Description
1. Raw data loaded 29,377 0 100.0% Total probes in raw data file
2. Expression matrix 29,377 0 100.0% Probes with valid signal columns
3. Detection filter 20,233 9,144 68.9% Probes detected in >= 3 samples

NA Value Checking

IMPORTANT: Check for NA values in both signal and detection p-value matrices early in the QC script. This identifies data quality issues before they cause downstream problems.

# Check for NAs in signal data
na_signal <- is.na(expr_mat)
n_na_signal_total <- sum(na_signal)
n_na_signal_by_sample <- colSums(na_signal)
n_na_signal_by_probe <- rowSums(na_signal)

cat(sprintf("Signal data - Total NA values: %d (%.2f%%)\n",
            n_na_signal_total,
            100 * n_na_signal_total / length(expr_mat)))

if (n_na_signal_total > 0) {
  cat("NA counts by sample (signal):\n")
  for (i in seq_along(sample_names)) {
    if (n_na_signal_by_sample[i] > 0) {
      cat(sprintf("  %s: %d NAs (%.2f%%)\n",
                  sample_names[i],
                  n_na_signal_by_sample[i],
                  100 * n_na_signal_by_sample[i] / nrow(expr_mat)))
    }
  }
  cat(sprintf("Probes with at least one NA: %d (%.2f%%)\n",
              sum(n_na_signal_by_probe > 0),
              100 * sum(n_na_signal_by_probe > 0) / nrow(expr_mat)))
}

# Check for NAs in detection p-values (Illumina)
na_detect <- is.na(detect_mat)
n_na_detect_total <- sum(na_detect)
# ... similar reporting

# Create NA summary table
na_summary <- data.frame(
  sample = sample_names,
  na_signal_count = n_na_signal_by_sample,
  na_signal_pct = 100 * n_na_signal_by_sample / nrow(expr_mat),
  na_detect_count = n_na_detect_by_sample,
  na_detect_pct = 100 * n_na_detect_by_sample / nrow(detect_mat)
)
write.csv(na_summary, file.path(tables_dir, "01_na_value_summary.csv"), row.names = FALSE)

NA Visualization

If NA values are present, generate diagnostic plots:

if (n_na_signal_total > 0 || n_na_detect_total > 0) {
  # Bar plot comparing signal vs detection NAs by sample
  na_summary_long <- na_summary %>%
    pivot_longer(cols = c(na_signal_count, na_detect_count),
                 names_to = "data_type", values_to = "na_count")

  p_na <- ggplot(na_summary_long, aes(x = sample, y = na_count, fill = data_type)) +
    geom_bar(stat = "identity", position = "dodge") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

  ggsave(file.path(plots_dir, "01_na_comparison.png"), p_na, width = 12, height = 6)

  # Optional: Heatmap of NA patterns if not too many probes affected
  if (sum(n_na_signal_by_probe > 0) < 1000) {
    # Create binary NA heatmap with pheatmap
  }
}

Handling NAs in Downstream Analysis

  • Use na.rm = TRUE for summary statistics: colMeans(expr_mat, na.rm = TRUE)
  • Clean data before correlation/PCA: expr_clean <- expr_mat[complete.cases(expr_mat), ]
  • Document the number of probes removed due to NA values in the probe tracking table

Platform Detection and Data Loading

Affymetrix Arrays (CEL files)

Use affy and affyPLM packages:

library(affy)
library(affyPLM)

# Load CEL files
raw_data <- ReadAffy(filenames = cel_files)

# Basic QC
pm_data <- pm(raw_data)
mm_data <- mm(raw_data)
avg_pm <- colMeans(pm_data)

# MAS5 present/absent calls
calls <- mas5calls(raw_data)
percent_present <- colSums(exprs(calls) == "P") / nrow(exprs(calls)) * 100

# PLM QC (NUSE and RLE)
plm_fit <- fitPLM(raw_data)
nuse_vals <- NUSE(plm_fit, type = "values")
rle_vals <- RLE(plm_fit, type = "values")

# RMA normalization
eset <- rma(raw_data)

Annotation packages by platform:

  • GPL570 (HG-U133_Plus_2): hgu133plus2.db
  • GPL96 (HG-U133A): hgu133a.db
  • GPL97 (HG-U133B): hgu133b.db

Illumina BeadChip Arrays

Use limma with detection p-values:

library(limma)

# Data typically has AVG_Signal and Detection Pval columns
expr_mat <- as.matrix(data[, signal_cols])
detect_mat <- as.matrix(data[, pval_cols])

# Filter by detection (p < 0.05 in >= 3 samples)
probes_detected <- rowSums(detect_mat < 0.05) >= 3
expr_filtered <- expr_mat[probes_detected, ]

# Log2 transform and quantile normalize
expr_log2 <- log2(expr_filtered + 1)
expr_norm <- normalizeBetweenArrays(expr_log2, method = "quantile")

Annotation packages by platform:

  • GPL14951 (HumanHT-12 v4): illuminaHumanv4.db
  • GPL10558 (HumanHT-12 v4.0): illuminaHumanv4.db
  • GPL6884 (HumanHT-12 v3): illuminaHumanv3.db

Quality Control Plots

Generate these QC plots (numbered sequentially):

  1. Raw intensity boxplot - Distribution per sample
  2. Raw intensity density - Overlapping density curves
  3. Percent present/detected - Bar plot per sample
  4. Average intensity - Bar plot per sample
  5. Sample correlation heatmap - Hierarchical clustering
  6. PCA plot - PC1 vs PC2 colored by condition
  7. QC metrics by condition - Boxplots comparing groups

Post-normalization visualization:

  • Overall expression heatmap (e.g., 19b) - Top 2000 most variable genes
    • Median-centered, ±3 SD clipping for color utilization
    • 1 - Pearson correlation distance, average linkage for gene clustering
    • Samples ordered by condition (not clustered)
    • Green-black-red color scheme with condition color bars
  • Sample dendrogram (e.g., 19c) - Hierarchical clustering of all samples
    • Uses all genes with 1 - Pearson correlation distance
    • Average linkage, horizontal layout
    • Branches and labels colored by sample condition

Affymetrix-specific:

  • RNA degradation plot - 5' to 3' bias
  • NUSE boxplot - Outliers if median > 1.05
  • RLE boxplot - Outliers if |median| > 0.1

Outlier Detection Criteria

Metric Threshold Action
NUSE median > 1.05 Flag as outlier
NUSE IQR > 0.15 Flag as outlier
RLE median value
RLE IQR > 0.3 Flag as outlier
Intensity z-score z
PCA extreme Visual outlier Investigate

After excluding outliers: Re-run normalization (especially RMA which uses quantile normalization).

Statistical Modeling with limma

Paired Design (same subjects, multiple conditions)

For matched samples (e.g., lesional vs non-lesional from same patient):

# Create factors
tissue <- factor(sample_info$condition, levels = c("NonLesional", "Lesional"))
patient <- factor(sample_info$patient_id)

# Design with patient as fixed effect (n >= 10 patients)
design <- model.matrix(~ patient + tissue)

# Fit model
fit <- lmFit(expr_filtered, design)

# Contrasts (tissue effects are relative to baseline)
contrast_matrix <- makeContrasts(
  Lesional_vs_NonLesional = tissueLesional,
  levels = design
)

fit2 <- contrasts.fit(fit, contrast_matrix)
fit2 <- eBayes(fit2)

Unpaired Design (independent groups)

For independent samples:

# Create factor
group <- factor(sample_info$condition, levels = c("Control", "Treatment"))

# Cell means model (no intercept)
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)

# Fit model
fit <- lmFit(expr_filtered, design)

# Define contrast
contrast_matrix <- makeContrasts(
  Treatment_vs_Control = Treatment - Control,
  levels = design
)

fit2 <- contrasts.fit(fit, contrast_matrix)
fit2 <- eBayes(fit2)

Interaction Terms (genotype x treatment)

For factorial designs:

# Create combined factor
group <- factor(paste(sample_info$genotype, sample_info$treatment, sep = "_"))
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)

# Define biologically meaningful contrasts
contrast_matrix <- makeContrasts(
  Treatment_in_WT = WT_Treated - WT_Control,
  Treatment_in_Mut = Mut_Treated - Mut_Control,
  Interaction = (Mut_Treated - Mut_Control) - (WT_Treated - WT_Control),
  levels = design
)

Model Diagnostics

Always generate these diagnostic plots:

  1. P-value histogram - Should show uniform + spike near 0
  2. MA plot - Check for intensity-dependent bias
  3. Mean-variance trend (plotSA) - Validates eBayes assumptions
# P-value histogram
png(file.path(plots_dir, "03_pvalue_histogram.png"), width = 800, height = 600)
hist(results$P.Value, breaks = 50,
     main = "P-value Distribution", xlab = "P-value",
     col = "steelblue", border = "white")
abline(h = nrow(results) / 50, col = "red", lty = 2)
legend("topright", "Expected under null", col = "red", lty = 2)
dev.off()

# MA plot
png(file.path(plots_dir, "03_MA_plot.png"), width = 800, height = 600)
plotMA(fit2, coef = "contrast_name", main = "MA Plot")
dev.off()

# Mean-variance trend (validates eBayes assumptions)
png(file.path(plots_dir, "03_mean_variance_trend.png"), width = 800, height = 600)
plotSA(fit2, main = "Mean-Variance Trend (eBayes)")
dev.off()

Volcano Plot Best Practices

Important: The y-axis should show FDR (adjusted p-value), not raw p-value:

p_volcano <- ggplot(results, aes(x = logFC, y = -log10(adj.P.Val), color = significance)) +
  geom_point(alpha = 0.5, size = 1.5) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "gray40") +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray40") +
  labs(title = "Volcano Plot",
       x = "log2 Fold Change",
       y = "-log10(FDR)")  # Use FDR, not P-value

Gene ID Mapping

Using the collapse_to_gene Utility Function

Use the shared utility function scripts/utils/collapse_to_gene.R for probe-to-gene mapping. This function:

  1. Queries the annotation database for the specified gene ID type
  2. Removes probes that map to multiple gene IDs (ambiguous mapping)
  3. Keeps the most significant probe per gene (by P.Value, |logFC| tiebreaker)
source(here("scripts", "utils", "collapse_to_gene.R"))

# Basic usage - collapse to gene symbols
de_annotated <- collapse_to_gene(
  de_results,
  annot_pkg = "illuminaHumanv4.db",  # Platform-specific annotation package
  id_column = "SYMBOL",
  probe_column = "ProbeID",
  keytype = "PROBEID",
  collapse = TRUE,    # Set FALSE to keep all probes with annotations
  verbose = TRUE
) %>%
  rename(gene_symbol = SYMBOL) %>%
  relocate(ProbeID, gene_symbol)

# Get probe-level annotations without collapsing
de_probe_level <- collapse_to_gene(
  de_results,
  annot_pkg = "illuminaHumanv4.db",
  id_column = "SYMBOL",
  probe_column = "ProbeID",
  collapse = FALSE  # Keep all probes
)

Annotation Packages by Platform

Platform GPL ID Annotation Package
Illumina HumanHT-12 v4 GPL14951, GPL10558 illuminaHumanv4.db
Illumina HumanHT-12 v3 GPL6884 illuminaHumanv3.db
Affymetrix HG-U133 Plus 2 GPL570 hgu133plus2.db
Affymetrix HG-U133A GPL96 hgu133a.db

Manual Probe-to-Gene Mapping (Alternative)

If not using the utility function:

library(AnnotationDbi)

# Get mapping (platform-specific)
mapping <- AnnotationDbi::select(
  platform.db,
  keys = probe_ids,
  columns = c("SYMBOL", "ENTREZID", "ENSEMBL", "GENENAME"),
  keytype = "PROBEID"
)

# Join with DE results
de_annotated <- de_results |>
  left_join(mapping, by = c("ProbeID" = "PROBEID"))

# Collapse to unique gene symbols (keep most significant probe)
de_clean_symbol <- de_annotated %>%
  filter(!is.na(gene_symbol)) %>%
  arrange(P.Value, -abs(logFC)) %>%
  filter(!duplicated(gene_symbol)) %>%
  arrange(adj.P.Val)

Output Format

Differential Expression Results

Required columns in output CSV:

Column Description
ensembl_id Ensembl gene ID (primary identifier)
gene_symbol HGNC gene symbol
logFC Log2 fold change
P.Value Raw p-value
adj.P.Val FDR-adjusted p-value
ProbeID Original probe/probeset ID
AveExpr Average expression
t Moderated t-statistic
B Log-odds of differential expression

File Naming Convention

Use numbered prefixes (01_, 02_, etc.) to indicate the script that generated each file:

results/tables/
├── 01_sample_metadata.csv                 # From 01_load_and_qc.R
├── 01_qc_metrics.csv
├── 01_expression_matrix_raw.csv
├── 01_probe_tracking.csv
├── 01_na_value_summary.csv                # NA diagnostics
├── 02_expression_matrix_normalized.csv    # From 02_normalize.R
├── 02_normalization_summary.csv
├── 03_DE_[Comparison].csv                 # From 03_differential_expression.R
├── 04_DE_[Comparison]_annotated.csv       # From 04_annotate_genes.R
├── 04_DE_[Comparison]_clean_symbol.csv    # Unique gene symbols
├── 04_genes_upregulated.csv
├── 04_genes_downregulated.csv
├── 05_GO_BP_upregulated.csv               # From 05_go_enrichment.R
├── 05_GO_BP_downregulated.csv
├── 05_KEGG_upregulated.csv
├── 05_KEGG_downregulated.csv
└── 05_GSEA_GO_BP.csv
results/plots/
├── 01_raw_intensity_boxplot.png           # From 01_load_and_qc.R
├── 01_raw_intensity_density.png
├── 01_PCA_raw.png
├── 01_sample_correlation_heatmap.png
├── 01_na_comparison.png                   # If NAs present
├── 02_normalized_boxplot.png              # From 02_normalize.R
├── 02_PCA_normalized.png
├── 03_pvalue_histogram.png                # From 03_differential_expression.R
├── 03_MA_plot.png
├── 03_mean_variance_trend.png
├── 03_volcano_plot.png
├── 03_heatmap_top_DE.png
├── 05_GO_BP_combined.png                  # From 05_go_enrichment.R
├── 05_KEGG_combined.png
├── 05_GSEA_GO_BP_combined.png
├── 05_GSEA_GO_BP_dotplot.png
└── 05_GSEA_GO_BP_ridgeplot.png

Note: The _clean_symbol.csv file contains one row per gene (collapsed from multiple probesets), selected by lowest P.Value. Use this for downstream analysis.

Progressive Documentation (analysis.md)

Update analysis.md after EACH analysis step with:

# [GSE#] Microarray Analysis Report

## Study Information

| Field | Value |
|-------|-------|
| **GEO Accession** | GSE##### |
| **Platform** | GPL### [Platform Name] |
| **Analysis Date** | YYYY-MM-DD |

## Methods

### Software Environment

| Package | Version | Purpose |
|---------|---------|---------|
| limma | X.X.X | Differential expression |
| affy | X.X.X | CEL file processing |

### Key Parameters
- [List all thresholds and parameters used]

## Results

### Quality Control
- [Summarize QC findings]
- [Document any outliers removed]

### Differential Expression
| Comparison | Total DEGs | Up | Down |
|------------|------------|-----|------|
| ... | ... | ... | ... |

### Key Findings
- [Biological interpretation]

## Output Files
[List all generated files]

Plot Documentation (plots/plots.md)

Create results/plots/plots.md with embedded image references so plots render in markdown viewers:

# Plot Documentation - [GSE#]

## Quality Control

### 01_raw_intensity_boxplot.png
Raw probe intensity distribution by sample before normalization.

![Raw intensity boxplot](./01_raw_intensity_boxplot.png)

### 02_raw_intensity_density.png
Density curves of raw intensities colored by condition.

![Raw intensity density](./02_raw_intensity_density.png)

[Continue for all plots with image references...]

## Differential Expression

### 15_volcano_plot.png
Volcano plot showing -log10(p-value) vs log2 fold change.
- Red: Upregulated (FDR < 0.05, |logFC| > 1)
- Blue: Downregulated (FDR < 0.05, |logFC| > 1)

![Volcano plot](./15_volcano_plot.png)

Important: Use ./ prefix for relative paths (e.g., ![description](./filename.png)) to ensure images render correctly in VS Code and other markdown viewers.

GO and KEGG Enrichment Analysis

Use clusterProfiler for pathway analysis. Run both GO Biological Process and KEGG pathway enrichment for both upregulated AND downregulated genes in each comparison.

Background Gene List

Important: Use all genes that passed the filtering step as background, loaded from the annotation file:

library(clusterProfiler)
library(org.Hs.eg.db)
library(ggplot2)
library(dplyr)

# Load background from annotation file (all filtered genes)
annotation_df <- read.csv(file.path(tables_dir, "probeset_annotation.csv"))
background_genes <- annotation_df %>%
  filter(!is.na(entrez_id)) %>%
  pull(entrez_id) %>%
  unique() %>%
  as.character()

GO Biological Process Enrichment (Both Directions)

# Get Entrez IDs for UPREGULATED genes
sig_up <- de_results %>%
  filter(adj.P.Val < 0.05 & logFC > 0 & !is.na(entrez_id)) %>%
  pull(entrez_id) %>%
  unique() %>%
  as.character()

# Get Entrez IDs for DOWNREGULATED genes
sig_down <- de_results %>%
  filter(adj.P.Val < 0.05 & logFC < 0 & !is.na(entrez_id)) %>%
  pull(entrez_id) %>%
  unique() %>%
  as.character()

# GO enrichment - Upregulated
ego_up <- enrichGO(
  gene = sig_up,
  universe = background_genes,
  OrgDb = org.Hs.eg.db,
  keyType = "ENTREZID",
  ont = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  readable = TRUE
)
ego_up_simp <- simplify(ego_up, cutoff = 0.7, by = "p.adjust")

# GO enrichment - Downregulated
ego_down <- enrichGO(
  gene = sig_down,
  universe = background_genes,
  OrgDb = org.Hs.eg.db,
  keyType = "ENTREZID",
  ont = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  readable = TRUE
)
ego_down_simp <- simplify(ego_down, cutoff = 0.7, by = "p.adjust")

KEGG Pathway Enrichment (Both Directions)

# KEGG - Upregulated
kegg_up <- enrichKEGG(
  gene = sig_up,
  universe = background_genes,
  organism = "hsa",
  keyType = "ncbi-geneid",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05
)

# KEGG - Downregulated
kegg_down <- enrichKEGG(
  gene = sig_down,
  universe = background_genes,
  organism = "hsa",
  keyType = "ncbi-geneid",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05
)

Combined Plots with Fold Enrichment

Use the utility functions from scripts/utils/enrichment_dotplot.R for creating combined enrichment plots:

source(here("scripts/utils/enrichment_dotplot.R"))

# Helper function to prepare ORA data with fold enrichment
prepare_ora_data <- function(enrichResult, direction_label) {
  if (is.null(enrichResult) || nrow(as.data.frame(enrichResult)) == 0) return(NULL)

  df <- as.data.frame(enrichResult)
  df <- df %>%
    mutate(
      gene_ratio_num = as.numeric(sub("/.*", "", GeneRatio)),
      gene_ratio_denom = as.numeric(sub(".*/", "", GeneRatio)),
      bg_ratio_num = as.numeric(sub("/.*", "", BgRatio)),
      bg_ratio_denom = as.numeric(sub(".*/", "", BgRatio)),
      GeneRatioValue = gene_ratio_num / gene_ratio_denom,
      BgRatioValue = bg_ratio_num / bg_ratio_denom,
      FoldEnrichment = GeneRatioValue / BgRatioValue,
      direction = direction_label
    )
  return(df)
}

# Prepare data for plotting
go_up_ora <- prepare_ora_data(ego_up_simp, "UP")
go_down_ora <- prepare_ora_data(ego_down_simp, "DOWN")
go_combined_ora <- bind_rows(go_up_ora, go_down_ora)

# Create combined GO BP plot using utility function
p_go_combined <- plot_ora_dotplot_combined(
  go_combined_ora,
  n_show = 10,
  fdr_cutoff = 0.05,
  title = "GO Biological Process: Comparison",
  width = 10,
  height = 8
)
ggsave(file.path(plots_dir, "05_GO_BP_combined.png"), p_go_combined,
       width = attr(p_go_combined, "width"),
       height = attr(p_go_combined, "height"), dpi = 300)

# Create combined KEGG plot
kegg_up_ora <- prepare_ora_data(kegg_up, "UP")
kegg_down_ora <- prepare_ora_data(kegg_down, "DOWN")
kegg_combined_ora <- bind_rows(kegg_up_ora, kegg_down_ora)

p_kegg_combined <- plot_ora_dotplot_combined(
  kegg_combined_ora,
  n_show = 10,
  fdr_cutoff = 0.05,
  title = "KEGG Pathways: Comparison"
)
ggsave(file.path(plots_dir, "05_KEGG_combined.png"), p_kegg_combined,
       width = attr(p_kegg_combined, "width"),
       height = attr(p_kegg_combined, "height"), dpi = 300)

GSEA Combined Plots

For GSEA results, use plot_gsea_dotplot_combined():

# Run GSEA
gsea_go <- gseGO(geneList = gene_list, OrgDb = org.Hs.eg.db, ...)
gsea_df <- as.data.frame(gsea_go)

# Create combined GSEA plot
p_gsea_combined <- plot_gsea_dotplot_combined(
  gsea_df,
  n_show = 10,
  fdr_cutoff = 0.05,
  title = "GSEA GO BP: Comparison",
  size_label = "Set Size"
)
ggsave(file.path(plots_dir, "05_GSEA_GO_BP_combined.png"), p_gsea_combined,
       width = attr(p_gsea_combined, "width"),
       height = attr(p_gsea_combined, "height"), dpi = 300)

Output Files

For each comparison, produce:

  • ORA combined plots:
    • 05_GO_BP_combined.png - Up + down on same figure
    • 05_KEGG_combined.png - Up + down on same figure
  • GSEA plots:
    • 05_GSEA_GO_BP_combined.png - Activated + suppressed
    • 05_GSEA_GO_BP_dotplot.png - Standard clusterProfiler dotplot
    • 05_GSEA_GO_BP_ridgeplot.png - Distribution ridge plot
    • 05_GSEA_GO_BP_running.png - Running enrichment for top terms
  • CSV tables:
    • 05_GO_BP_upregulated.csv / 05_GO_BP_downregulated.csv
    • 05_KEGG_upregulated.csv / 05_KEGG_downregulated.csv
    • 05_GSEA_GO_BP.csv

Contrast Selection Guidelines

Only extract biologically meaningful contrasts:

Good contrasts:

  • Treatment vs Control
  • Disease vs Healthy
  • Lesional vs Non-Lesional (paired)
  • Time point comparisons in time series

Avoid:

  • Redundant comparisons (if A-B and B-C, you may not need A-C)
  • Comparisons without biological interpretation
  • All pairwise when only specific comparisons matter

Common Issues and Solutions

Issue Solution
Few DEGs with FDR < 0.05 Use nominal P < 0.05 for exploratory GO analysis
Batch effects visible in PCA Consider ComBat or limma's removeBatchEffect
Low annotation rate Try alternative ID mapping sources (biomaRt)
High proportion of outlier samples Review sample preparation notes; may need to exclude

Utility Functions

The project includes reusable utility functions in scripts/utils/:

Function File Description
collapse_to_gene() collapse_to_gene.R Maps probes to genes, removes ambiguous mappings, collapses to unique genes
plot_ora_dotplot_combined() enrichment_dotplot.R Combined ORA dotplot for up/downregulated genes
plot_gsea_dotplot_combined() enrichment_dotplot.R Combined GSEA dotplot for activated/suppressed terms

Always source these utilities using here():

source(here("scripts", "utils", "collapse_to_gene.R"))
source(here("scripts", "utils", "enrichment_dotplot.R"))

Reference Files

See these files in the project for complete examples:

  • analyses/Rashighi_GSE53146/ - Illumina unpaired design (most current patterns)
  • analyses/Regazzetti_GSE65127/ - Affymetrix paired design
  • analyses/Natarajan_GSE75819/ - Illumina paired design
Weekly Installs
1
GitHub Stars
1
First Seen
Apr 3, 2026
Installed on
mcpjam1
claude-code1
kilo1
junie1
windsurf1
zencoder1