bio-longread-qc
Long-Read Quality Control
NanoPlot - Visualization
# From FASTQ
NanoPlot --fastq reads.fastq.gz -o nanoplot_output -t 4
# From BAM
NanoPlot --bam aligned.bam -o nanoplot_output -t 4
# From sequencing summary (fastest)
NanoPlot --summary sequencing_summary.txt -o nanoplot_output
NanoPlot - Common Options
NanoPlot --fastq reads.fastq.gz \
-o nanoplot_output \
-t 8 \
--N50 \ # Show N50 in plots
--title "Sample QC" \
--plots hex dot \ # Plot types
--format png pdf \ # Output formats
--color darkblue \
--maxlength 50000 \ # Max length for plots
--minlength 500 # Min length for plots
NanoStat - Statistics Only
# Quick statistics (no plots)
NanoStat --fastq reads.fastq.gz --threads 4
# From BAM
NanoStat --bam aligned.bam --threads 4
# Output to file
NanoStat --fastq reads.fastq.gz --threads 4 > qc_stats.txt
chopper - Filter Reads
# Filter by length and quality
gunzip -c reads.fastq.gz | chopper -q 10 -l 1000 | gzip > filtered.fastq.gz
# Quality >= 10, length >= 1000bp
chopper - Common Options
gunzip -c reads.fastq.gz | chopper \
--quality 10 \ # Min quality
--minlength 1000 \ # Min length
--maxlength 50000 \ # Max length
--headcrop 50 \ # Remove from start
--tailcrop 50 \ # Remove from end
--threads 4 \
| gzip > filtered.fastq.gz
NanoFilt - Alternative Filter
# Filter with NanoFilt
gunzip -c reads.fastq.gz | NanoFilt -q 10 -l 1000 | gzip > filtered.fastq.gz
# With more options
gunzip -c reads.fastq.gz | NanoFilt \
--quality 10 \
--length 1000 \
--maxlength 50000 \
--headcrop 50 \
| gzip > filtered.fastq.gz
Porechop - Adapter Trimming
# Trim adapters
porechop -i reads.fastq.gz -o trimmed.fastq.gz --threads 8
# With barcode splitting
porechop -i reads.fastq.gz -b output_dir/ --threads 8
Generate Summary Statistics
# Quick summary with seqkit
seqkit stats reads.fastq.gz
# Detailed stats
seqkit stats -a reads.fastq.gz
# Watch stats during basecalling
seqkit watch --fields ReadLen,MeanQual reads.fastq.gz
PycoQC - From Basecalling
# Generate QC report from sequencing_summary.txt
pycoQC -f sequencing_summary.txt -o pycoqc_report.html
# With BAM for alignment stats
pycoQC -f sequencing_summary.txt -a aligned.bam -o pycoqc_report.html
Calculate N50
# With seqkit
seqkit stats -a reads.fastq.gz | grep N50
# Manual calculation
seqkit fx2tab -l reads.fastq.gz | cut -f 2 | sort -rn | \
awk '{sum+=$1; len[NR]=$1} END {
target=sum/2; cumsum=0;
for(i=1; i<=NR; i++) {
cumsum+=len[i];
if(cumsum>=target) {print "N50:", len[i]; break}
}
}'
Parse FASTQ Quality in Python
import numpy as np
from Bio import SeqIO
lengths = []
qualities = []
for record in SeqIO.parse('reads.fastq', 'fastq'):
lengths.append(len(record))
qualities.append(np.mean(record.letter_annotations['phred_quality']))
print(f'Total reads: {len(lengths)}')
print(f'Total bases: {sum(lengths):,}')
print(f'Mean length: {np.mean(lengths):.0f}')
print(f'Median length: {np.median(lengths):.0f}')
print(f'Mean quality: {np.mean(qualities):.1f}')
NanoPlot Output Files
| File | Description |
|---|---|
| NanoStats.txt | Summary statistics |
| NanoPlot-report.html | Interactive report |
| LengthvsQualityScatterPlot | Length vs Q plot |
| WeightedHistogramReadlength | Read length distribution |
| Yield_By_Length | Cumulative yield |
Key Parameters - NanoPlot
| Parameter | Description |
|---|---|
| --fastq | Input FASTQ |
| --bam | Input BAM |
| --summary | Sequencing summary |
| -o | Output directory |
| -t | Threads |
| --N50 | Show N50 line |
| --plots | Plot types |
| --format | Output formats |
Key Parameters - chopper
| Parameter | Default | Description |
|---|---|---|
| -q | 0 | Min quality |
| -l | 0 | Min length |
| --maxlength | inf | Max length |
| --headcrop | 0 | Trim from start |
| --tailcrop | 0 | Trim from end |
| -t | 4 | Threads |
Quality Thresholds
| Q Score | Accuracy | Typical Use |
|---|---|---|
| Q7 | ~80% | Very low quality |
| Q10 | ~90% | Basic filtering |
| Q15 | ~97% | Moderate filtering |
| Q20 | ~99% | High quality (SUP) |
| Q30 | ~99.9% | Very high (HiFi) |
Related Skills
- long-read-alignment - Align filtered reads
- sequence-io - FASTQ handling
- medaka-polishing - Polish with filtered reads
More from gptomics/bioskills
bioskills
Installs 425 bioinformatics skills covering sequence analysis, RNA-seq, single-cell, variant calling, metagenomics, structural biology, and 56 more categories. Use when setting up bioinformatics capabilities or when a bioinformatics task requires specialized skills not yet installed.
100bio-single-cell-batch-integration
Integrate multiple scRNA-seq samples/batches using Harmony, scVI, Seurat anchors, and fastMNN. Remove technical variation while preserving biological differences. Use when integrating multiple scRNA-seq batches or datasets.
5bio-epitranscriptomics-merip-preprocessing
Align and QC MeRIP-seq IP and input samples for m6A analysis. Use when preparing MeRIP-seq data for peak calling or differential methylation analysis.
5bio-data-visualization-multipanel-figures
Combine multiple plots into publication-ready multi-panel figures using patchwork, cowplot, or matplotlib GridSpec with shared legends and panel labels. Use when combining multiple plots into publication figures.
5bio-data-visualization-specialized-omics-plots
Reusable plotting functions for common omics visualizations. Custom ggplot2/matplotlib implementations of volcano, MA, PCA, enrichment dotplots, boxplots, and survival curves. Use when creating volcano, MA, or enrichment plots.
5bio-read-qc-fastp-workflow
All-in-one read preprocessing with fastp including adapter trimming, quality filtering, deduplication, base correction, and HTML report generation. Use when preprocessing Illumina data and wanting a single fast tool instead of separate Cutadapt, Trimmomatic, and FastQC steps.
5