bio-read-alignment-star-alignment
STAR RNA-seq Alignment
Generate Genome Index
# Basic index generation
STAR --runMode genomeGenerate \
--runThreadN 8 \
--genomeDir star_index/ \
--genomeFastaFiles reference.fa \
--sjdbGTFfile annotation.gtf \
--sjdbOverhang 100 # Read length - 1
Index with Specific Read Length
# For 150bp reads, use sjdbOverhang=149
STAR --runMode genomeGenerate \
--runThreadN 8 \
--genomeDir star_index_150/ \
--genomeFastaFiles reference.fa \
--sjdbGTFfile annotation.gtf \
--sjdbOverhang 149
Basic Alignment
# Paired-end alignment
STAR --runThreadN 8 \
--genomeDir star_index/ \
--readFilesIn reads_1.fq.gz reads_2.fq.gz \
--readFilesCommand zcat \
--outFileNamePrefix sample_ \
--outSAMtype BAM SortedByCoordinate
Single-End Alignment
STAR --runThreadN 8 \
--genomeDir star_index/ \
--readFilesIn reads.fq.gz \
--readFilesCommand zcat \
--outFileNamePrefix sample_ \
--outSAMtype BAM SortedByCoordinate
Two-Pass Mode
# Two-pass mode for better novel junction detection
STAR --runThreadN 8 \
--genomeDir star_index/ \
--readFilesIn r1.fq.gz r2.fq.gz \
--readFilesCommand zcat \
--outFileNamePrefix sample_ \
--outSAMtype BAM SortedByCoordinate \
--twopassMode Basic
Quantification Mode
# Output gene counts (like featureCounts)
STAR --runThreadN 8 \
--genomeDir star_index/ \
--readFilesIn r1.fq.gz r2.fq.gz \
--readFilesCommand zcat \
--outFileNamePrefix sample_ \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts
Output: sample_ReadsPerGene.out.tab with columns:
- Gene ID
- Unstranded counts
- Forward strand counts
- Reverse strand counts
ENCODE Options
# ENCODE recommended settings
STAR --runThreadN 8 \
--genomeDir star_index/ \
--readFilesIn r1.fq.gz r2.fq.gz \
--readFilesCommand zcat \
--outFileNamePrefix sample_ \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes NH HI AS NM MD \
--outFilterType BySJout \
--outFilterMultimapNmax 20 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1
Fusion Detection
# For chimeric/fusion detection
STAR --runThreadN 8 \
--genomeDir star_index/ \
--readFilesIn r1.fq.gz r2.fq.gz \
--readFilesCommand zcat \
--outFileNamePrefix sample_ \
--outSAMtype BAM SortedByCoordinate \
--chimSegmentMin 12 \
--chimJunctionOverhangMin 8 \
--chimOutType Junctions WithinBAM SoftClip \
--chimMainSegmentMultNmax 1
Output Files
| File | Description |
|---|---|
| *Aligned.sortedByCoord.out.bam | Sorted BAM file |
| *Log.final.out | Alignment summary statistics |
| *Log.out | Detailed log |
| *SJ.out.tab | Splice junctions |
| *ReadsPerGene.out.tab | Gene counts (if --quantMode) |
| *Chimeric.out.junction | Fusion candidates (if chimeric) |
Memory Requirements
# Reduce memory for limited systems
STAR --genomeLoad NoSharedMemory \
--limitBAMsortRAM 10000000000 \ # 10GB for sorting
...
# For very large genomes, limit during index generation
STAR --runMode genomeGenerate \
--limitGenomeGenerateRAM 31000000000 \ # 31GB
...
Shared Memory Mode
# Load genome into shared memory (for multiple samples)
STAR --genomeLoad LoadAndExit --genomeDir star_index/
# Run alignments (faster startup)
STAR --genomeLoad LoadAndKeep --genomeDir star_index/ ...
# Remove from memory when done
STAR --genomeLoad Remove --genomeDir star_index/
Key Parameters
| Parameter | Default | Description |
|---|---|---|
| --runThreadN | 1 | Number of threads |
| --sjdbOverhang | 100 | Read length - 1 |
| --outFilterMultimapNmax | 10 | Max multi-mapping |
| --alignIntronMax | 0 | Max intron size |
| --outFilterMismatchNmax | 10 | Max mismatches |
| --outSAMtype | SAM | Output format |
| --quantMode | - | GeneCounts for counting |
| --twopassMode | None | Basic for two-pass |
Related Skills
- rna-quantification/featurecounts-counting - Alternative counting
- rna-quantification/alignment-free-quant - Salmon/kallisto alternative
- differential-expression/deseq2-basics - Downstream DE analysis
- read-qc/fastp-workflow - Preprocess reads
More from gptomics/bioskills
bioskills
Installs 425 bioinformatics skills covering sequence analysis, RNA-seq, single-cell, variant calling, metagenomics, structural biology, and 56 more categories. Use when setting up bioinformatics capabilities or when a bioinformatics task requires specialized skills not yet installed.
100bio-single-cell-batch-integration
Integrate multiple scRNA-seq samples/batches using Harmony, scVI, Seurat anchors, and fastMNN. Remove technical variation while preserving biological differences. Use when integrating multiple scRNA-seq batches or datasets.
5bio-epitranscriptomics-merip-preprocessing
Align and QC MeRIP-seq IP and input samples for m6A analysis. Use when preparing MeRIP-seq data for peak calling or differential methylation analysis.
5bio-data-visualization-multipanel-figures
Combine multiple plots into publication-ready multi-panel figures using patchwork, cowplot, or matplotlib GridSpec with shared legends and panel labels. Use when combining multiple plots into publication figures.
5bio-data-visualization-specialized-omics-plots
Reusable plotting functions for common omics visualizations. Custom ggplot2/matplotlib implementations of volcano, MA, PCA, enrichment dotplots, boxplots, and survival curves. Use when creating volcano, MA, or enrichment plots.
5bio-read-qc-fastp-workflow
All-in-one read preprocessing with fastp including adapter trimming, quality filtering, deduplication, base correction, and HTML report generation. Use when preprocessing Illumina data and wanting a single fast tool instead of separate Cutadapt, Trimmomatic, and FastQC steps.
5