microarray-analysis
DNA Microarray Analysis Skill
This skill guides analysis of DNA microarray gene expression data following established patterns from this project.
Quick Start
When asked to analyze microarray data:
- Check if a
CLAUDE.mdfile exists in the study directory - it contains study-specific metadata - Determine the platform (Affymetrix vs Illumina) to select appropriate methods
- Create numbered scripts in
scripts/directory (01_, 02_, etc.) - Update
analysis.mdafter each analysis step - Generate all plots in
results/plots/withplots/plots.mddocumentation
Directory Structure
Create this structure for each analysis:
analyses/[FirstAuthor]_[GSE#]/
├── CLAUDE.md # Study metadata (samples, platform, design)
├── analysis.md # Progressive documentation (update after each step)
├── scripts/
│ ├── 01_load_and_qc.R
│ ├── 02_normalize.R (or 02_normalize_filtered.R)
│ ├── 03_differential_expression.R
│ ├── 04_annotate_genes.R
│ └── 05_go_enrichment.R
├── data/ # Raw and processed data (gitignored)
└── results/
├── tables/ # CSV output files
└── plots/ # PNG/PDF figures
└── plots.md # Plot documentation
Script Portability
Use the here package for all file paths to ensure scripts work regardless of working directory:
library(here)
# Set paths relative to project root
base_dir <- here("analyses", "Author_GSE#####")
data_dir <- file.path(base_dir, "data")
results_dir <- file.path(base_dir, "results")
plots_dir <- file.path(results_dir, "plots")
tables_dir <- file.path(results_dir, "tables")
# Source utility functions from project root
source(here("scripts", "utils", "collapse_to_gene.R"))
source(here("scripts", "utils", "enrichment_dotplot.R"))
Never use hardcoded absolute paths like /Users/name/path/to/file. Always use here() for portability.
Probe Tracking
IMPORTANT: Track probe counts at each processing step and report a summary table at the end of QC and normalization scripts. This provides transparency about data filtering.
Initialize Tracking
# Initialize probe tracking at the start of script
probe_tracking <- data.frame(
step = character(),
n_probes = integer(),
n_removed = integer(),
description = character(),
stringsAsFactors = FALSE
)
Track at Each Step
Add tracking after each filtering/processing step:
# After loading raw data
probe_tracking <- rbind(probe_tracking, data.frame(
step = "1. Raw data loaded",
n_probes = nrow(raw_data),
n_removed = 0,
description = "Total probes in raw data file"
))
# After creating expression matrix
probe_tracking <- rbind(probe_tracking, data.frame(
step = "2. Expression matrix created",
n_probes = nrow(expr_mat),
n_removed = nrow(raw_data) - nrow(expr_mat),
description = "Probes with valid signal columns"
))
# After detection filtering (Illumina)
probe_tracking <- rbind(probe_tracking, data.frame(
step = "3. Detection filter",
n_probes = nrow(expr_filtered),
n_removed = nrow(expr_mat) - nrow(expr_filtered),
description = sprintf("Probes detected (p < 0.05) in >= %d samples", min_samples_detected)
))
# After low-expression filtering
probe_tracking <- rbind(probe_tracking, data.frame(
step = "4. Expression filter",
n_probes = nrow(expr_filtered),
n_removed = n_before - nrow(expr_filtered),
description = "Probes with expression above threshold"
))
Report Summary
At the end of QC script:
# Calculate percentage remaining
probe_tracking$pct_remaining <- round(100 * probe_tracking$n_probes / probe_tracking$n_probes[1], 1)
cat("\n=== Probe Tracking Summary ===\n\n")
print(probe_tracking)
# Save to file
write.csv(probe_tracking, file.path(tables_dir, "probe_tracking.csv"), row.names = FALSE)
Expected Output
| Step | Probes | Removed | % Remaining | Description |
|---|---|---|---|---|
| 1. Raw data loaded | 29,377 | 0 | 100.0% | Total probes in raw data file |
| 2. Expression matrix | 29,377 | 0 | 100.0% | Probes with valid signal columns |
| 3. Detection filter | 20,233 | 9,144 | 68.9% | Probes detected in >= 3 samples |
NA Value Checking
IMPORTANT: Check for NA values in both signal and detection p-value matrices early in the QC script. This identifies data quality issues before they cause downstream problems.
# Check for NAs in signal data
na_signal <- is.na(expr_mat)
n_na_signal_total <- sum(na_signal)
n_na_signal_by_sample <- colSums(na_signal)
n_na_signal_by_probe <- rowSums(na_signal)
cat(sprintf("Signal data - Total NA values: %d (%.2f%%)\n",
n_na_signal_total,
100 * n_na_signal_total / length(expr_mat)))
if (n_na_signal_total > 0) {
cat("NA counts by sample (signal):\n")
for (i in seq_along(sample_names)) {
if (n_na_signal_by_sample[i] > 0) {
cat(sprintf(" %s: %d NAs (%.2f%%)\n",
sample_names[i],
n_na_signal_by_sample[i],
100 * n_na_signal_by_sample[i] / nrow(expr_mat)))
}
}
cat(sprintf("Probes with at least one NA: %d (%.2f%%)\n",
sum(n_na_signal_by_probe > 0),
100 * sum(n_na_signal_by_probe > 0) / nrow(expr_mat)))
}
# Check for NAs in detection p-values (Illumina)
na_detect <- is.na(detect_mat)
n_na_detect_total <- sum(na_detect)
# ... similar reporting
# Create NA summary table
na_summary <- data.frame(
sample = sample_names,
na_signal_count = n_na_signal_by_sample,
na_signal_pct = 100 * n_na_signal_by_sample / nrow(expr_mat),
na_detect_count = n_na_detect_by_sample,
na_detect_pct = 100 * n_na_detect_by_sample / nrow(detect_mat)
)
write.csv(na_summary, file.path(tables_dir, "01_na_value_summary.csv"), row.names = FALSE)
NA Visualization
If NA values are present, generate diagnostic plots:
if (n_na_signal_total > 0 || n_na_detect_total > 0) {
# Bar plot comparing signal vs detection NAs by sample
na_summary_long <- na_summary %>%
pivot_longer(cols = c(na_signal_count, na_detect_count),
names_to = "data_type", values_to = "na_count")
p_na <- ggplot(na_summary_long, aes(x = sample, y = na_count, fill = data_type)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggsave(file.path(plots_dir, "01_na_comparison.png"), p_na, width = 12, height = 6)
# Optional: Heatmap of NA patterns if not too many probes affected
if (sum(n_na_signal_by_probe > 0) < 1000) {
# Create binary NA heatmap with pheatmap
}
}
Handling NAs in Downstream Analysis
- Use
na.rm = TRUEfor summary statistics:colMeans(expr_mat, na.rm = TRUE) - Clean data before correlation/PCA:
expr_clean <- expr_mat[complete.cases(expr_mat), ] - Document the number of probes removed due to NA values in the probe tracking table
Platform Detection and Data Loading
Affymetrix Arrays (CEL files)
Use affy and affyPLM packages:
library(affy)
library(affyPLM)
# Load CEL files
raw_data <- ReadAffy(filenames = cel_files)
# Basic QC
pm_data <- pm(raw_data)
mm_data <- mm(raw_data)
avg_pm <- colMeans(pm_data)
# MAS5 present/absent calls
calls <- mas5calls(raw_data)
percent_present <- colSums(exprs(calls) == "P") / nrow(exprs(calls)) * 100
# PLM QC (NUSE and RLE)
plm_fit <- fitPLM(raw_data)
nuse_vals <- NUSE(plm_fit, type = "values")
rle_vals <- RLE(plm_fit, type = "values")
# RMA normalization
eset <- rma(raw_data)
Annotation packages by platform:
- GPL570 (HG-U133_Plus_2):
hgu133plus2.db - GPL96 (HG-U133A):
hgu133a.db - GPL97 (HG-U133B):
hgu133b.db
Illumina BeadChip Arrays
Use limma with detection p-values:
library(limma)
# Data typically has AVG_Signal and Detection Pval columns
expr_mat <- as.matrix(data[, signal_cols])
detect_mat <- as.matrix(data[, pval_cols])
# Filter by detection (p < 0.05 in >= 3 samples)
probes_detected <- rowSums(detect_mat < 0.05) >= 3
expr_filtered <- expr_mat[probes_detected, ]
# Log2 transform and quantile normalize
expr_log2 <- log2(expr_filtered + 1)
expr_norm <- normalizeBetweenArrays(expr_log2, method = "quantile")
Annotation packages by platform:
- GPL14951 (HumanHT-12 v4):
illuminaHumanv4.db - GPL10558 (HumanHT-12 v4.0):
illuminaHumanv4.db - GPL6884 (HumanHT-12 v3):
illuminaHumanv3.db
Quality Control Plots
Generate these QC plots (numbered sequentially):
- Raw intensity boxplot - Distribution per sample
- Raw intensity density - Overlapping density curves
- Percent present/detected - Bar plot per sample
- Average intensity - Bar plot per sample
- Sample correlation heatmap - Hierarchical clustering
- PCA plot - PC1 vs PC2 colored by condition
- QC metrics by condition - Boxplots comparing groups
Post-normalization visualization:
- Overall expression heatmap (e.g., 19b) - Top 2000 most variable genes
- Median-centered, ±3 SD clipping for color utilization
- 1 - Pearson correlation distance, average linkage for gene clustering
- Samples ordered by condition (not clustered)
- Green-black-red color scheme with condition color bars
- Sample dendrogram (e.g., 19c) - Hierarchical clustering of all samples
- Uses all genes with 1 - Pearson correlation distance
- Average linkage, horizontal layout
- Branches and labels colored by sample condition
Affymetrix-specific:
- RNA degradation plot - 5' to 3' bias
- NUSE boxplot - Outliers if median > 1.05
- RLE boxplot - Outliers if |median| > 0.1
Outlier Detection Criteria
| Metric | Threshold | Action |
|---|---|---|
| NUSE median | > 1.05 | Flag as outlier |
| NUSE IQR | > 0.15 | Flag as outlier |
| RLE median | value | |
| RLE IQR | > 0.3 | Flag as outlier |
| Intensity z-score | z | |
| PCA extreme | Visual outlier | Investigate |
After excluding outliers: Re-run normalization (especially RMA which uses quantile normalization).
Statistical Modeling with limma
Paired Design (same subjects, multiple conditions)
For matched samples (e.g., lesional vs non-lesional from same patient):
# Create factors
tissue <- factor(sample_info$condition, levels = c("NonLesional", "Lesional"))
patient <- factor(sample_info$patient_id)
# Design with patient as fixed effect (n >= 10 patients)
design <- model.matrix(~ patient + tissue)
# Fit model
fit <- lmFit(expr_filtered, design)
# Contrasts (tissue effects are relative to baseline)
contrast_matrix <- makeContrasts(
Lesional_vs_NonLesional = tissueLesional,
levels = design
)
fit2 <- contrasts.fit(fit, contrast_matrix)
fit2 <- eBayes(fit2)
Unpaired Design (independent groups)
For independent samples:
# Create factor
group <- factor(sample_info$condition, levels = c("Control", "Treatment"))
# Cell means model (no intercept)
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)
# Fit model
fit <- lmFit(expr_filtered, design)
# Define contrast
contrast_matrix <- makeContrasts(
Treatment_vs_Control = Treatment - Control,
levels = design
)
fit2 <- contrasts.fit(fit, contrast_matrix)
fit2 <- eBayes(fit2)
Interaction Terms (genotype x treatment)
For factorial designs:
# Create combined factor
group <- factor(paste(sample_info$genotype, sample_info$treatment, sep = "_"))
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)
# Define biologically meaningful contrasts
contrast_matrix <- makeContrasts(
Treatment_in_WT = WT_Treated - WT_Control,
Treatment_in_Mut = Mut_Treated - Mut_Control,
Interaction = (Mut_Treated - Mut_Control) - (WT_Treated - WT_Control),
levels = design
)
Model Diagnostics
Always generate these diagnostic plots:
- P-value histogram - Should show uniform + spike near 0
- MA plot - Check for intensity-dependent bias
- Mean-variance trend (plotSA) - Validates eBayes assumptions
# P-value histogram
png(file.path(plots_dir, "03_pvalue_histogram.png"), width = 800, height = 600)
hist(results$P.Value, breaks = 50,
main = "P-value Distribution", xlab = "P-value",
col = "steelblue", border = "white")
abline(h = nrow(results) / 50, col = "red", lty = 2)
legend("topright", "Expected under null", col = "red", lty = 2)
dev.off()
# MA plot
png(file.path(plots_dir, "03_MA_plot.png"), width = 800, height = 600)
plotMA(fit2, coef = "contrast_name", main = "MA Plot")
dev.off()
# Mean-variance trend (validates eBayes assumptions)
png(file.path(plots_dir, "03_mean_variance_trend.png"), width = 800, height = 600)
plotSA(fit2, main = "Mean-Variance Trend (eBayes)")
dev.off()
Volcano Plot Best Practices
Important: The y-axis should show FDR (adjusted p-value), not raw p-value:
p_volcano <- ggplot(results, aes(x = logFC, y = -log10(adj.P.Val), color = significance)) +
geom_point(alpha = 0.5, size = 1.5) +
geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "gray40") +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray40") +
labs(title = "Volcano Plot",
x = "log2 Fold Change",
y = "-log10(FDR)") # Use FDR, not P-value
Gene ID Mapping
Using the collapse_to_gene Utility Function
Use the shared utility function scripts/utils/collapse_to_gene.R for probe-to-gene mapping. This function:
- Queries the annotation database for the specified gene ID type
- Removes probes that map to multiple gene IDs (ambiguous mapping)
- Keeps the most significant probe per gene (by P.Value, |logFC| tiebreaker)
source(here("scripts", "utils", "collapse_to_gene.R"))
# Basic usage - collapse to gene symbols
de_annotated <- collapse_to_gene(
de_results,
annot_pkg = "illuminaHumanv4.db", # Platform-specific annotation package
id_column = "SYMBOL",
probe_column = "ProbeID",
keytype = "PROBEID",
collapse = TRUE, # Set FALSE to keep all probes with annotations
verbose = TRUE
) %>%
rename(gene_symbol = SYMBOL) %>%
relocate(ProbeID, gene_symbol)
# Get probe-level annotations without collapsing
de_probe_level <- collapse_to_gene(
de_results,
annot_pkg = "illuminaHumanv4.db",
id_column = "SYMBOL",
probe_column = "ProbeID",
collapse = FALSE # Keep all probes
)
Annotation Packages by Platform
| Platform | GPL ID | Annotation Package |
|---|---|---|
| Illumina HumanHT-12 v4 | GPL14951, GPL10558 | illuminaHumanv4.db |
| Illumina HumanHT-12 v3 | GPL6884 | illuminaHumanv3.db |
| Affymetrix HG-U133 Plus 2 | GPL570 | hgu133plus2.db |
| Affymetrix HG-U133A | GPL96 | hgu133a.db |
Manual Probe-to-Gene Mapping (Alternative)
If not using the utility function:
library(AnnotationDbi)
# Get mapping (platform-specific)
mapping <- AnnotationDbi::select(
platform.db,
keys = probe_ids,
columns = c("SYMBOL", "ENTREZID", "ENSEMBL", "GENENAME"),
keytype = "PROBEID"
)
# Join with DE results
de_annotated <- de_results |>
left_join(mapping, by = c("ProbeID" = "PROBEID"))
# Collapse to unique gene symbols (keep most significant probe)
de_clean_symbol <- de_annotated %>%
filter(!is.na(gene_symbol)) %>%
arrange(P.Value, -abs(logFC)) %>%
filter(!duplicated(gene_symbol)) %>%
arrange(adj.P.Val)
Output Format
Differential Expression Results
Required columns in output CSV:
| Column | Description |
|---|---|
| ensembl_id | Ensembl gene ID (primary identifier) |
| gene_symbol | HGNC gene symbol |
| logFC | Log2 fold change |
| P.Value | Raw p-value |
| adj.P.Val | FDR-adjusted p-value |
| ProbeID | Original probe/probeset ID |
| AveExpr | Average expression |
| t | Moderated t-statistic |
| B | Log-odds of differential expression |
File Naming Convention
Use numbered prefixes (01_, 02_, etc.) to indicate the script that generated each file:
results/tables/
├── 01_sample_metadata.csv # From 01_load_and_qc.R
├── 01_qc_metrics.csv
├── 01_expression_matrix_raw.csv
├── 01_probe_tracking.csv
├── 01_na_value_summary.csv # NA diagnostics
├── 02_expression_matrix_normalized.csv # From 02_normalize.R
├── 02_normalization_summary.csv
├── 03_DE_[Comparison].csv # From 03_differential_expression.R
├── 04_DE_[Comparison]_annotated.csv # From 04_annotate_genes.R
├── 04_DE_[Comparison]_clean_symbol.csv # Unique gene symbols
├── 04_genes_upregulated.csv
├── 04_genes_downregulated.csv
├── 05_GO_BP_upregulated.csv # From 05_go_enrichment.R
├── 05_GO_BP_downregulated.csv
├── 05_KEGG_upregulated.csv
├── 05_KEGG_downregulated.csv
└── 05_GSEA_GO_BP.csv
results/plots/
├── 01_raw_intensity_boxplot.png # From 01_load_and_qc.R
├── 01_raw_intensity_density.png
├── 01_PCA_raw.png
├── 01_sample_correlation_heatmap.png
├── 01_na_comparison.png # If NAs present
├── 02_normalized_boxplot.png # From 02_normalize.R
├── 02_PCA_normalized.png
├── 03_pvalue_histogram.png # From 03_differential_expression.R
├── 03_MA_plot.png
├── 03_mean_variance_trend.png
├── 03_volcano_plot.png
├── 03_heatmap_top_DE.png
├── 05_GO_BP_combined.png # From 05_go_enrichment.R
├── 05_KEGG_combined.png
├── 05_GSEA_GO_BP_combined.png
├── 05_GSEA_GO_BP_dotplot.png
└── 05_GSEA_GO_BP_ridgeplot.png
Note: The _clean_symbol.csv file contains one row per gene (collapsed from multiple probesets), selected by lowest P.Value. Use this for downstream analysis.
Progressive Documentation (analysis.md)
Update analysis.md after EACH analysis step with:
# [GSE#] Microarray Analysis Report
## Study Information
| Field | Value |
|-------|-------|
| **GEO Accession** | GSE##### |
| **Platform** | GPL### [Platform Name] |
| **Analysis Date** | YYYY-MM-DD |
## Methods
### Software Environment
| Package | Version | Purpose |
|---------|---------|---------|
| limma | X.X.X | Differential expression |
| affy | X.X.X | CEL file processing |
### Key Parameters
- [List all thresholds and parameters used]
## Results
### Quality Control
- [Summarize QC findings]
- [Document any outliers removed]
### Differential Expression
| Comparison | Total DEGs | Up | Down |
|------------|------------|-----|------|
| ... | ... | ... | ... |
### Key Findings
- [Biological interpretation]
## Output Files
[List all generated files]
Plot Documentation (plots/plots.md)
Create results/plots/plots.md with embedded image references so plots render in markdown viewers:
# Plot Documentation - [GSE#]
## Quality Control
### 01_raw_intensity_boxplot.png
Raw probe intensity distribution by sample before normalization.

### 02_raw_intensity_density.png
Density curves of raw intensities colored by condition.

[Continue for all plots with image references...]
## Differential Expression
### 15_volcano_plot.png
Volcano plot showing -log10(p-value) vs log2 fold change.
- Red: Upregulated (FDR < 0.05, |logFC| > 1)
- Blue: Downregulated (FDR < 0.05, |logFC| > 1)

Important: Use ./ prefix for relative paths (e.g., ) to ensure images render correctly in VS Code and other markdown viewers.
GO and KEGG Enrichment Analysis
Use clusterProfiler for pathway analysis. Run both GO Biological Process and KEGG pathway enrichment for both upregulated AND downregulated genes in each comparison.
Background Gene List
Important: Use all genes that passed the filtering step as background, loaded from the annotation file:
library(clusterProfiler)
library(org.Hs.eg.db)
library(ggplot2)
library(dplyr)
# Load background from annotation file (all filtered genes)
annotation_df <- read.csv(file.path(tables_dir, "probeset_annotation.csv"))
background_genes <- annotation_df %>%
filter(!is.na(entrez_id)) %>%
pull(entrez_id) %>%
unique() %>%
as.character()
GO Biological Process Enrichment (Both Directions)
# Get Entrez IDs for UPREGULATED genes
sig_up <- de_results %>%
filter(adj.P.Val < 0.05 & logFC > 0 & !is.na(entrez_id)) %>%
pull(entrez_id) %>%
unique() %>%
as.character()
# Get Entrez IDs for DOWNREGULATED genes
sig_down <- de_results %>%
filter(adj.P.Val < 0.05 & logFC < 0 & !is.na(entrez_id)) %>%
pull(entrez_id) %>%
unique() %>%
as.character()
# GO enrichment - Upregulated
ego_up <- enrichGO(
gene = sig_up,
universe = background_genes,
OrgDb = org.Hs.eg.db,
keyType = "ENTREZID",
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
readable = TRUE
)
ego_up_simp <- simplify(ego_up, cutoff = 0.7, by = "p.adjust")
# GO enrichment - Downregulated
ego_down <- enrichGO(
gene = sig_down,
universe = background_genes,
OrgDb = org.Hs.eg.db,
keyType = "ENTREZID",
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
readable = TRUE
)
ego_down_simp <- simplify(ego_down, cutoff = 0.7, by = "p.adjust")
KEGG Pathway Enrichment (Both Directions)
# KEGG - Upregulated
kegg_up <- enrichKEGG(
gene = sig_up,
universe = background_genes,
organism = "hsa",
keyType = "ncbi-geneid",
pAdjustMethod = "BH",
pvalueCutoff = 0.05
)
# KEGG - Downregulated
kegg_down <- enrichKEGG(
gene = sig_down,
universe = background_genes,
organism = "hsa",
keyType = "ncbi-geneid",
pAdjustMethod = "BH",
pvalueCutoff = 0.05
)
Combined Plots with Fold Enrichment
Use the utility functions from scripts/utils/enrichment_dotplot.R for creating combined enrichment plots:
source(here("scripts/utils/enrichment_dotplot.R"))
# Helper function to prepare ORA data with fold enrichment
prepare_ora_data <- function(enrichResult, direction_label) {
if (is.null(enrichResult) || nrow(as.data.frame(enrichResult)) == 0) return(NULL)
df <- as.data.frame(enrichResult)
df <- df %>%
mutate(
gene_ratio_num = as.numeric(sub("/.*", "", GeneRatio)),
gene_ratio_denom = as.numeric(sub(".*/", "", GeneRatio)),
bg_ratio_num = as.numeric(sub("/.*", "", BgRatio)),
bg_ratio_denom = as.numeric(sub(".*/", "", BgRatio)),
GeneRatioValue = gene_ratio_num / gene_ratio_denom,
BgRatioValue = bg_ratio_num / bg_ratio_denom,
FoldEnrichment = GeneRatioValue / BgRatioValue,
direction = direction_label
)
return(df)
}
# Prepare data for plotting
go_up_ora <- prepare_ora_data(ego_up_simp, "UP")
go_down_ora <- prepare_ora_data(ego_down_simp, "DOWN")
go_combined_ora <- bind_rows(go_up_ora, go_down_ora)
# Create combined GO BP plot using utility function
p_go_combined <- plot_ora_dotplot_combined(
go_combined_ora,
n_show = 10,
fdr_cutoff = 0.05,
title = "GO Biological Process: Comparison",
width = 10,
height = 8
)
ggsave(file.path(plots_dir, "05_GO_BP_combined.png"), p_go_combined,
width = attr(p_go_combined, "width"),
height = attr(p_go_combined, "height"), dpi = 300)
# Create combined KEGG plot
kegg_up_ora <- prepare_ora_data(kegg_up, "UP")
kegg_down_ora <- prepare_ora_data(kegg_down, "DOWN")
kegg_combined_ora <- bind_rows(kegg_up_ora, kegg_down_ora)
p_kegg_combined <- plot_ora_dotplot_combined(
kegg_combined_ora,
n_show = 10,
fdr_cutoff = 0.05,
title = "KEGG Pathways: Comparison"
)
ggsave(file.path(plots_dir, "05_KEGG_combined.png"), p_kegg_combined,
width = attr(p_kegg_combined, "width"),
height = attr(p_kegg_combined, "height"), dpi = 300)
GSEA Combined Plots
For GSEA results, use plot_gsea_dotplot_combined():
# Run GSEA
gsea_go <- gseGO(geneList = gene_list, OrgDb = org.Hs.eg.db, ...)
gsea_df <- as.data.frame(gsea_go)
# Create combined GSEA plot
p_gsea_combined <- plot_gsea_dotplot_combined(
gsea_df,
n_show = 10,
fdr_cutoff = 0.05,
title = "GSEA GO BP: Comparison",
size_label = "Set Size"
)
ggsave(file.path(plots_dir, "05_GSEA_GO_BP_combined.png"), p_gsea_combined,
width = attr(p_gsea_combined, "width"),
height = attr(p_gsea_combined, "height"), dpi = 300)
Output Files
For each comparison, produce:
- ORA combined plots:
05_GO_BP_combined.png- Up + down on same figure05_KEGG_combined.png- Up + down on same figure
- GSEA plots:
05_GSEA_GO_BP_combined.png- Activated + suppressed05_GSEA_GO_BP_dotplot.png- Standard clusterProfiler dotplot05_GSEA_GO_BP_ridgeplot.png- Distribution ridge plot05_GSEA_GO_BP_running.png- Running enrichment for top terms
- CSV tables:
05_GO_BP_upregulated.csv/05_GO_BP_downregulated.csv05_KEGG_upregulated.csv/05_KEGG_downregulated.csv05_GSEA_GO_BP.csv
Contrast Selection Guidelines
Only extract biologically meaningful contrasts:
Good contrasts:
- Treatment vs Control
- Disease vs Healthy
- Lesional vs Non-Lesional (paired)
- Time point comparisons in time series
Avoid:
- Redundant comparisons (if A-B and B-C, you may not need A-C)
- Comparisons without biological interpretation
- All pairwise when only specific comparisons matter
Common Issues and Solutions
| Issue | Solution |
|---|---|
| Few DEGs with FDR < 0.05 | Use nominal P < 0.05 for exploratory GO analysis |
| Batch effects visible in PCA | Consider ComBat or limma's removeBatchEffect |
| Low annotation rate | Try alternative ID mapping sources (biomaRt) |
| High proportion of outlier samples | Review sample preparation notes; may need to exclude |
Utility Functions
The project includes reusable utility functions in scripts/utils/:
| Function | File | Description |
|---|---|---|
collapse_to_gene() |
collapse_to_gene.R |
Maps probes to genes, removes ambiguous mappings, collapses to unique genes |
plot_ora_dotplot_combined() |
enrichment_dotplot.R |
Combined ORA dotplot for up/downregulated genes |
plot_gsea_dotplot_combined() |
enrichment_dotplot.R |
Combined GSEA dotplot for activated/suppressed terms |
Always source these utilities using here():
source(here("scripts", "utils", "collapse_to_gene.R"))
source(here("scripts", "utils", "enrichment_dotplot.R"))
Reference Files
See these files in the project for complete examples:
analyses/Rashighi_GSE53146/- Illumina unpaired design (most current patterns)analyses/Regazzetti_GSE65127/- Affymetrix paired designanalyses/Natarajan_GSE75819/- Illumina paired design