rnaseq-analysis
Bulk RNA-seq Analysis Skill
This skill guides analysis of bulk RNA-seq gene expression count data using DESeq2, following established patterns from this project.
Quick Start
When asked to analyze RNA-seq data:
- Check if a
CLAUDE.mdfile exists in the study directory - it contains study-specific metadata - Determine the data format (count matrix, featureCounts output, RSEM, etc.)
- Create numbered scripts in
scripts/directory (01_, 02_, etc.) - Update
analysis.mdafter each analysis step - Generate all plots in
results/plots/withplots/plots.mddocumentation
Directory Structure
Create this structure for each analysis:
analyses/[FirstAuthor]_[GSE#]/
├── CLAUDE.md # Study metadata (samples, platform, design)
├── analysis.md # Progressive documentation (update after each step)
├── scripts/
│ ├── 01_load_and_qc.R
│ ├── 02_deseq2_analysis.R
│ ├── 03_differential_expression.R
│ ├── 04_annotate_genes.R
│ └── 05_go_enrichment.R
├── data/ # Raw and processed data (gitignored)
└── results/
├── tables/ # CSV output files
└── plots/ # PNG/PDF figures
└── plots.md # Plot documentation
Gene Tracking
IMPORTANT: Track gene counts at each processing step and report a summary table at the end of QC and DESeq2 scripts. This provides transparency about data filtering.
Initialize Tracking
# Initialize gene tracking at the start of script
gene_tracking <- data.frame(
step = character(),
n_genes = integer(),
n_removed = integer(),
description = character(),
stringsAsFactors = FALSE
)
Track at Each Step
Add tracking after each filtering/processing step:
# After loading raw counts
gene_tracking <- rbind(gene_tracking, data.frame(
step = "1. Raw counts loaded",
n_genes = nrow(counts),
n_removed = 0,
description = "Total genes in count matrix"
))
# After pre-filtering low counts
gene_tracking <- rbind(gene_tracking, data.frame(
step = "2. Pre-filtering",
n_genes = nrow(counts_filtered),
n_removed = nrow(counts) - nrow(counts_filtered),
description = sprintf("Genes with >= %d counts in >= %d samples", min_count, min_samples)
))
# After DESeq2 independent filtering
gene_tracking <- rbind(gene_tracking, data.frame(
step = "3. DESeq2 independent filtering",
n_genes = sum(!is.na(res$padj)),
n_removed = sum(is.na(res$padj)),
description = "Genes with sufficient counts for testing"
))
# After annotation (genes with valid IDs)
gene_tracking <- rbind(gene_tracking, data.frame(
step = "4. Gene annotation",
n_genes = sum(!is.na(res_annotated$gene_symbol)),
n_removed = sum(is.na(res_annotated$gene_symbol)),
description = "Genes with valid gene symbols"
))
Report Summary
At the end of analysis script:
# Calculate percentage remaining
gene_tracking$pct_remaining <- round(100 * gene_tracking$n_genes / gene_tracking$n_genes[1], 1)
cat("\n=== Gene Tracking Summary ===\n\n")
print(gene_tracking)
# Save to file
write.csv(gene_tracking, file.path(tables_dir, "gene_tracking.csv"), row.names = FALSE)
Expected Output
| Step | Genes | Removed | % Remaining | Description |
|---|---|---|---|---|
| 1. Raw counts loaded | 60,623 | 0 | 100.0% | Total genes in count matrix |
| 2. Pre-filtering | 18,432 | 42,191 | 30.4% | Genes with >= 10 counts in >= 3 samples |
| 3. DESeq2 independent filtering | 15,821 | 2,611 | 26.1% | Genes with sufficient counts for testing |
| 4. Gene annotation | 14,956 | 865 | 24.7% | Genes with valid gene symbols |
Data Loading
Standard Script Header with Relative Paths
Use the here package for portable, relative paths:
.libPaths(c("~/R/library", .libPaths()))
suppressPackageStartupMessages({
library(DESeq2)
library(ggplot2)
library(dplyr)
library(tidyr)
library(pheatmap)
library(RColorBrewer)
library(here)
})
# Source utility functions
source(here("scripts/utils/filter_counts.R"))
# Set analysis directory using here package
analysis_dir <- here("analyses", "Author_GSE#####")
# Create output directories
dir.create(file.path(analysis_dir, "results/tables"),
recursive = TRUE, showWarnings = FALSE)
dir.create(file.path(analysis_dir, "results/plots"),
recursive = TRUE, showWarnings = FALSE)
Count Matrix Format
DESeq2 expects raw (non-normalized) integer counts:
# Load count matrix (genes as rows, samples as columns)
counts <- read.csv(file.path(analysis_dir, "data/counts.csv"), row.names = 1)
counts <- as.matrix(counts)
# Ensure integer counts
storage.mode(counts) <- "integer"
# Load sample metadata
sample_info <- read.csv(file.path(analysis_dir, "data/sample_metadata.csv"))
rownames(sample_info) <- sample_info$sample_id
# Ensure sample order matches
counts <- counts[, rownames(sample_info)]
From GEO Downloads
GEO data may come in various formats:
# Typical GEO series matrix
geo_data <- read.delim("GSE#####_count_matrix.txt.gz", row.names = 1)
# featureCounts output
fc_data <- read.delim("featureCounts_output.txt", comment.char = "#")
counts <- fc_data[, 7:ncol(fc_data)]
rownames(counts) <- fc_data$Geneid
# RSEM expected counts (round to integers)
rsem_data <- read.delim("rsem_genes.txt", row.names = 1)
counts <- round(rsem_data[, grep("expected_count", colnames(rsem_data))])
Quality Control
Pre-filtering
Use the filter_counts() utility function for CPM-based and total count filtering:
# Source the utility function
source(here("scripts/utils/filter_counts.R"))
# Define filtering parameters
FILTER_MIN_CPM <- 2 # Minimum CPM threshold
FILTER_MIN_SAMPLE_PERCENT <- 0.3 # Min % of samples (30%)
FILTER_MIN_TOTAL_COUNTS <- 20 # Min total counts across all samples
# Filter using CPM and total count thresholds
counts_filtered <- filter_counts(
counts,
min_cpm = FILTER_MIN_CPM,
min_sample_percent = FILTER_MIN_SAMPLE_PERCENT,
min_total_counts = FILTER_MIN_TOTAL_COUNTS
)
The filter_counts() function (in scripts/utils/filter_counts.R) filters genes that meet both:
- CPM threshold in a minimum percentage of samples
- Minimum total counts across all samples
This produces informative output:
Filtering statistics:
Total samples: 10
Min samples required: 3 (30%)
CPM threshold: 2
Total count threshold: 20
Total genes: 60623
Genes passing CPM filter: 15432
Genes passing total counts filter: 18234
Genes passing both filters: 14521
Genes removed: 46102
QC Metrics
# Library sizes (total counts per sample)
lib_sizes <- colSums(counts_filtered)
# Number of detected genes per sample
genes_detected <- colSums(counts_filtered > 0)
# Create QC summary
qc_summary <- data.frame(
sample_id = colnames(counts_filtered),
library_size = lib_sizes,
genes_detected = genes_detected,
condition = sample_info$condition
)
# Z-score for outlier detection
qc_summary$lib_size_zscore <- scale(qc_summary$library_size)[, 1]
qc_summary$genes_zscore <- scale(qc_summary$genes_detected)[, 1]
# Flag potential outliers
qc_summary$outlier_flag <- abs(qc_summary$lib_size_zscore) > 2 |
abs(qc_summary$genes_zscore) > 2
Quality Control Plots
Generate these QC plots (numbered sequentially):
- Library size barplot - Total counts per sample
- Genes detected barplot - Number of genes with >0 counts per sample
- Count distribution boxplot - Log2(count+1) distribution per sample
- Count density plot - Overlapping density curves
- Sample correlation heatmap - Hierarchical clustering
- PCA plot - PC1 vs PC2 colored by condition
- Sample distance heatmap - Euclidean distance after VST
Post-normalization visualization:
- Overall expression heatmap (e.g., 08b) - Top 2000 most variable genes
- Use VST-transformed counts
- Median-centered, ±3 SD clipping for color utilization
- 1 - Pearson correlation distance, average linkage for gene clustering
- Samples ordered by condition (not clustered)
- Green-black-red color scheme with condition color bars
- Sample dendrogram (e.g., 08c) - Hierarchical clustering of all samples
- Uses all genes from VST-transformed counts
- 1 - Pearson correlation distance, average linkage
- Horizontal layout with branches and labels colored by condition
Outlier Detection Criteria
| Metric | Threshold | Action |
|---|---|---|
| Library size z-score | z | |
| Genes detected z-score | z | |
| PCA extreme | Visual outlier | Investigate |
| Low correlation | < 0.8 with group | Investigate |
After excluding outliers: Re-create DESeqDataSet with filtered samples.
DESeq2 Analysis
Create DESeqDataSet
library(DESeq2)
# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(
countData = counts_filtered,
colData = sample_info,
design = ~ condition # Basic unpaired design
)
# Set reference level
dds$condition <- relevel(dds$condition, ref = "Control")
Paired Design (same subjects, multiple conditions)
For matched samples (e.g., lesional vs non-lesional from same patient):
# Ensure factors are properly set
dds$condition <- factor(sample_info$condition, levels = c("NonLesional", "Lesional"))
dds$patient <- factor(sample_info$patient_id)
# Design with patient as blocking factor
design(dds) <- ~ patient + condition
Unpaired Design (independent groups)
For independent samples:
# Simple two-group comparison
dds$condition <- factor(sample_info$condition, levels = c("Control", "Treatment"))
design(dds) <- ~ condition
Multi-factor Design
For multiple variables:
# Treatment and batch effects
dds$treatment <- factor(sample_info$treatment)
dds$batch <- factor(sample_info$batch)
design(dds) <- ~ batch + treatment
# Interaction model
dds$genotype <- factor(sample_info$genotype)
dds$treatment <- factor(sample_info$treatment)
design(dds) <- ~ genotype + treatment + genotype:treatment
Run DESeq2
# Run the differential expression pipeline
dds <- DESeq(dds)
# Check available coefficients
resultsNames(dds)
Extracting Results
Basic Results Extraction
# Extract results for a specific contrast
res <- results(dds,
name = "condition_Treatment_vs_Control", # Or use contrast
alpha = 0.05)
# Or use contrast argument
res <- results(dds,
contrast = c("condition", "Treatment", "Control"),
alpha = 0.05)
# Summary
summary(res)
# Convert to data frame and sort
res_df <- as.data.frame(res)
res_df$gene_id <- rownames(res_df)
res_df <- res_df[order(res_df$pvalue), ]
Log Fold Change Shrinkage
Always apply LFC shrinkage for better estimates and ranking:
# Use apeglm method (recommended)
library(apeglm)
res_shrunk <- lfcShrink(dds,
coef = "condition_Treatment_vs_Control",
type = "apeglm")
# Alternative: ashr method
res_shrunk <- lfcShrink(dds,
contrast = c("condition", "Treatment", "Control"),
type = "ashr")
Multiple Contrasts
# Define all contrasts of interest
contrasts <- list(
Treatment_vs_Control = c("condition", "Treatment", "Control"),
TimePoint2_vs_TimePoint1 = c("timepoint", "T2", "T1")
)
# Extract all results
all_results <- lapply(names(contrasts), function(name) {
res <- results(dds, contrast = contrasts[[name]], alpha = 0.05)
res_df <- as.data.frame(res)
res_df$gene_id <- rownames(res_df)
res_df$contrast <- name
return(res_df)
})
names(all_results) <- names(contrasts)
Variance Stabilizing Transformation
Use VST or rlog for visualization (PCA, heatmaps):
# VST (faster, recommended for n > 30)
vsd <- vst(dds, blind = FALSE)
# rlog (better for small sample sizes n < 30)
rld <- rlog(dds, blind = FALSE)
# Extract transformed values
vst_mat <- assay(vsd)
Model Diagnostics
Always generate these diagnostic plots:
- P-value histogram - Should show uniform + spike near 0
- MA plot - Check for intensity-dependent bias
- Dispersion plot - Check model fit
# P-value histogram
hist(res$pvalue, breaks = 50,
main = "P-value Distribution", xlab = "P-value")
abline(h = nrow(res) / 50, col = "red", lty = 2)
# MA plot
plotMA(res, main = "MA Plot", ylim = c(-5, 5))
# Dispersion plot
plotDispEsts(dds, main = "Dispersion Estimates")
Gene ID Mapping
Using org.Hs.eg.db
library(AnnotationDbi)
library(org.Hs.eg.db)
# Get gene IDs from results
gene_ids <- rownames(res)
# Map Ensembl to symbols (if starting with Ensembl IDs)
# Remove version numbers if present
gene_ids_clean <- gsub("\\.\\d+$", "", gene_ids)
mapping <- AnnotationDbi::select(
org.Hs.eg.db,
keys = gene_ids_clean,
columns = c("SYMBOL", "ENTREZID"),
keytype = "ENSEMBL"
)
colnames(mapping) <- c("ensembl_id", "gene_symbol", "entrez_id")
# Handle duplicates (one Ensembl -> multiple mappings)
mapping_unique <- mapping %>%
group_by(ensembl_id) %>%
summarize(
gene_symbol = first(na.omit(gene_symbol)),
entrez_id = first(na.omit(entrez_id))
)
Using biomaRt
library(biomaRt)
ensembl <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl")
mapping <- getBM(
attributes = c("ensembl_gene_id", "hgnc_symbol", "entrezgene_id"),
filters = "ensembl_gene_id",
values = gene_ids_clean,
mart = ensembl
)
Output Format
Differential Expression Results
Produce output files with both unshrunken and shrunken log2 fold changes. The shrunken LFC should be used for ranking and visualization, while unshrunken values are retained for reference.
Required columns in output CSV:
| Column | Description |
|---|---|
| gene_id | Original gene ID, Ensembl, or Entrez |
| baseMean | Mean of normalized counts |
| log2FoldChange | Shrunken log2 fold change (use for ranking/visualization) |
| log2FoldChange_unshrunken | Unshrunken log2 fold change (for reference) |
| lfcSE | Standard error of shrunken LFC |
| stat | Wald statistic (from unshrunken results) |
| pvalue | Raw p-value |
| padj | BH-adjusted p-value |
| entrez_id | Entrez gene ID (if annotated) |
| gene_symbol | Gene symbol (if annotated) |
Important: Use type = "normal" for LFC shrinkage when using paired designs, as apeglm may not work with all coefficient names:
# Apply LFC shrinkage
res_shrunk <- lfcShrink(dds,
coef = "condition_Lesional_vs_NonLesional",
type = "normal")
# Combine shrunken and unshrunken results
res_df <- as.data.frame(res_shrunk)
res_df$gene_id <- rownames(res_df)
res_df$log2FoldChange_unshrunken <- res_unshrunken$log2FoldChange
res_df$stat <- res_unshrunken$stat
For multiple gene IDs mapped to the same symbol, keep the most significant (lowest P-value). Unmapped genes can be deleted from annotated output files.
Similarily,
File Naming Convention
results/tables/
├── sample_metadata.csv
├── qc_metrics.csv
├── normalized_counts.csv # DESeq2 normalized counts
├── vst_counts.csv # VST transformed counts
├── DE_[Comparison]_full.csv # Full results
├── DE_[Comparison]_annotated.csv # With gene annotation
├── DE_[Comparison]_significant.csv # FDR < 0.05 only
├── genes_upregulated.csv
├── genes_downregulated.csv
├── GO_BP_[Comparison]_UP.csv
├── GO_BP_[Comparison]_DOWN.csv
└── GO_BP_summary.csv
Progressive Documentation (analysis.md)
Update analysis.md after EACH analysis step with:
# [GSE#] RNA-seq Analysis Report
## Study Information
| Field | Value |
|-------|-------|
| **GEO Accession** | GSE##### |
| **Data Type** | Bulk RNA-seq |
| **Analysis Date** | YYYY-MM-DD |
## Methods
### Software Environment
| Package | Version | Purpose |
|---------|---------|---------|
| DESeq2 | X.X.X | Differential expression |
| apeglm | X.X.X | LFC shrinkage |
### Key Parameters
- Pre-filtering: Genes with ≥10 counts in ≥3 samples
- FDR threshold: 0.05
- LFC shrinkage method: apeglm
## Results
### Quality Control
- [Summarize QC findings]
- [Document any outliers removed]
### Differential Expression
| Comparison | Design | Total DEGs | Up | Down |
|------------|--------|------------|-----|------|
| ... | ... | ... | ... | ... |
### Key Findings
- [Biological interpretation]
## Output Files
[List all generated files]
Plot Documentation (plots/plots.md)
Create results/plots/plots.md with embedded image references:
# Plot Documentation - [GSE#]
## Quality Control
### 01_library_sizes.png
Library size (total counts) per sample.

### 02_genes_detected.png
Number of genes detected (count > 0) per sample.

### 03_count_distribution.png
Boxplot of log2(count+1) distribution per sample.

[Continue for all plots...]
## Differential Expression
### 15_volcano_plot.png
Volcano plot showing -log10(p-value) vs log2 fold change.
- Red: Upregulated (FDR < 0.05, |logFC| > 1)
- Blue: Downregulated (FDR < 0.05, |logFC| > 1)

GO and KEGG Enrichment Analysis
Use clusterProfiler for pathway analysis. Run both GO Biological Process and KEGG pathway enrichment for both upregulated AND downregulated genes in each comparison.
Background Gene List
Important: Use all genes that passed the filtering step as background:
library(clusterProfiler)
library(org.Hs.eg.db)
library(ggplot2)
library(dplyr)
# Background = all genes tested (from annotated results)
background_genes <- res_annotated %>%
filter(!is.na(entrez_id)) %>%
pull(entrez_id) %>%
unique() %>%
as.character()
GO Biological Process Enrichment (Both Directions)
# Get Entrez IDs for UPREGULATED genes
sig_up <- res_annotated %>%
filter(padj < 0.05 & log2FoldChange > 0 & !is.na(entrez_id)) %>%
pull(entrez_id) %>%
unique() %>%
as.character()
# Get Entrez IDs for DOWNREGULATED genes
sig_down <- res_annotated %>%
filter(padj < 0.05 & log2FoldChange < 0 & !is.na(entrez_id)) %>%
pull(entrez_id) %>%
unique() %>%
as.character()
# GO enrichment - Upregulated
ego_up <- enrichGO(
gene = sig_up,
universe = background_genes,
OrgDb = org.Hs.eg.db,
keyType = "ENTREZID",
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
readable = TRUE
)
ego_up_simp <- simplify(ego_up, cutoff = 0.7, by = "p.adjust")
# GO enrichment - Downregulated
ego_down <- enrichGO(
gene = sig_down,
universe = background_genes,
OrgDb = org.Hs.eg.db,
keyType = "ENTREZID",
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
readable = TRUE
)
ego_down_simp <- simplify(ego_down, cutoff = 0.7, by = "p.adjust")
KEGG Pathway Enrichment (Both Directions)
# KEGG - Upregulated
kegg_up <- enrichKEGG(
gene = sig_up,
universe = background_genes,
organism = "hsa",
keyType = "ncbi-geneid",
pAdjustMethod = "BH",
pvalueCutoff = 0.05
)
# KEGG - Downregulated
kegg_down <- enrichKEGG(
gene = sig_down,
universe = background_genes,
organism = "hsa",
keyType = "ncbi-geneid",
pAdjustMethod = "BH",
pvalueCutoff = 0.05
)
Combined Plots with Fold Enrichment
Create combined diverging plots showing both upregulated and downregulated pathways on the same figure, with fold enrichment on the x-axis:
# Helper function to calculate fold enrichment
calculate_fold_enrichment <- function(enrichResult) {
if (is.null(enrichResult) || nrow(enrichResult@result) == 0) return(NULL)
df <- as.data.frame(enrichResult)
df <- df %>%
mutate(
gene_ratio_num = as.numeric(sub("/.*", "", GeneRatio)),
gene_ratio_denom = as.numeric(sub(".*/", "", GeneRatio)),
bg_ratio_num = as.numeric(sub("/.*", "", BgRatio)),
bg_ratio_denom = as.numeric(sub(".*/", "", BgRatio)),
GeneRatioValue = gene_ratio_num / gene_ratio_denom,
BgRatioValue = bg_ratio_num / bg_ratio_denom,
FoldEnrichment = GeneRatioValue / BgRatioValue
)
return(df)
}
# Helper function to create combined plot
create_combined_enrichment_plot <- function(df_up, df_down, title, n_terms = 10) {
plot_data <- data.frame()
if (!is.null(df_up) && nrow(df_up) > 0) {
up_top <- df_up %>%
filter(p.adjust < 0.05) %>%
arrange(p.adjust) %>%
head(n_terms) %>%
mutate(Direction = "Upregulated")
plot_data <- bind_rows(plot_data, up_top)
}
if (!is.null(df_down) && nrow(df_down) > 0) {
down_top <- df_down %>%
filter(p.adjust < 0.05) %>%
arrange(p.adjust) %>%
head(n_terms) %>%
mutate(Direction = "Downregulated")
plot_data <- bind_rows(plot_data, down_top)
}
if (nrow(plot_data) == 0) return(NULL)
# Truncate descriptions and create signed fold enrichment
plot_data <- plot_data %>%
mutate(
Description_short = ifelse(nchar(Description) > 45,
paste0(substr(Description, 1, 42), "..."),
Description),
FoldEnrichment_signed = ifelse(Direction == "Downregulated",
-FoldEnrichment, FoldEnrichment)
) %>%
arrange(Direction, FoldEnrichment) %>%
mutate(Description_short = factor(Description_short,
levels = unique(Description_short)))
# Create diverging dot plot
ggplot(plot_data, aes(x = FoldEnrichment_signed, y = Description_short,
fill = Direction, size = Count)) +
geom_point(shape = 21, alpha = 0.8) +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
scale_fill_manual(values = c("Upregulated" = "#E74C3C",
"Downregulated" = "#3498DB")) +
scale_size_continuous(range = c(3, 10), name = "Gene Count") +
labs(title = title,
subtitle = sprintf("Top %d terms per direction (FDR < 0.05)", n_terms),
x = "Fold Enrichment", y = "") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text.y = element_text(size = 9))
}
# Generate combined plots
go_up_fe <- calculate_fold_enrichment(ego_up_simp)
go_down_fe <- calculate_fold_enrichment(ego_down_simp)
kegg_up_fe <- calculate_fold_enrichment(kegg_up)
kegg_down_fe <- calculate_fold_enrichment(kegg_down)
p_go <- create_combined_enrichment_plot(go_up_fe, go_down_fe,
"GO Biological Process")
p_kegg <- create_combined_enrichment_plot(kegg_up_fe, kegg_down_fe,
"KEGG Pathways")
ggsave("results/plots/XX_GO_BP_combined.png", p_go, width = 10, height = 8)
ggsave("results/plots/XX_KEGG_combined.png", p_kegg, width = 10, height = 8)
Output Files
For each comparison, produce:
- One combined GO BP plot:
[num]_GO_BP_combined.png(up + down on same figure) - One combined KEGG plot:
[num]_KEGG_combined.png(up + down on same figure) - CSV tables: Separate files for UP and DOWN directions
GO_BP_[comparison]_UP.csvGO_BP_[comparison]_DOWN.csvKEGG_[comparison]_UP.csvKEGG_[comparison]_DOWN.csv
GSEA Analysis
For ranked gene set enrichment:
library(fgsea)
library(msigdbr)
# Create ranked gene list (by stat or signed -log10(p))
gene_ranks <- res_annotated %>%
filter(!is.na(gene_symbol)) %>%
arrange(desc(stat)) %>%
distinct(gene_symbol, .keep_all = TRUE)
ranks <- setNames(gene_ranks$stat, gene_ranks$gene_symbol)
# Get gene sets
hallmark <- msigdbr(species = "Homo sapiens", category = "H")
hallmark_list <- split(hallmark$gene_symbol, hallmark$gs_name)
# Run GSEA
fgsea_res <- fgsea(
pathways = hallmark_list,
stats = ranks,
minSize = 15,
maxSize = 500,
nperm = 10000
)
Contrast Selection Guidelines
Only extract biologically meaningful contrasts:
Good contrasts:
- Treatment vs Control
- Disease vs Healthy
- Lesional vs Non-Lesional (paired)
- Timepoint comparisons in time series
Avoid:
- Redundant comparisons
- Comparisons without biological interpretation
Common Issues and Solutions
| Issue | Solution |
|---|---|
| Few DEGs with FDR < 0.05 | Check sample variability; consider relaxing threshold for exploratory analysis |
| Batch effects visible in PCA | Add batch to design formula; consider ComBat-seq for severe effects |
| Low mapping rate annotation | Check gene ID format; remove version numbers from Ensembl IDs |
| Cook's distance warnings | Investigate flagged samples for potential outliers |
| Dispersion outliers | May indicate problematic genes or samples |
Reference Files
See these files in the project for complete examples:
analyses/Brunner_bulk/- Pseudobulk RNA-seq with DESeq2 (paired design)analyses/Shiu_bulk/- Pseudobulk from scRNA-seqscripts/utils/filter_counts.R- CPM-based filtering utility function