skills/gptomics/bioskills/bio-longread-qc

bio-longread-qc

SKILL.md

Long-Read Quality Control

NanoPlot - Visualization

# From FASTQ
NanoPlot --fastq reads.fastq.gz -o nanoplot_output -t 4

# From BAM
NanoPlot --bam aligned.bam -o nanoplot_output -t 4

# From sequencing summary (fastest)
NanoPlot --summary sequencing_summary.txt -o nanoplot_output

NanoPlot - Common Options

NanoPlot --fastq reads.fastq.gz \
    -o nanoplot_output \
    -t 8 \
    --N50 \                        # Show N50 in plots
    --title "Sample QC" \
    --plots hex dot \              # Plot types
    --format png pdf \             # Output formats
    --color darkblue \
    --maxlength 50000 \            # Max length for plots
    --minlength 500                # Min length for plots

NanoStat - Statistics Only

# Quick statistics (no plots)
NanoStat --fastq reads.fastq.gz --threads 4

# From BAM
NanoStat --bam aligned.bam --threads 4

# Output to file
NanoStat --fastq reads.fastq.gz --threads 4 > qc_stats.txt

chopper - Filter Reads

# Filter by length and quality
gunzip -c reads.fastq.gz | chopper -q 10 -l 1000 | gzip > filtered.fastq.gz

# Quality >= 10, length >= 1000bp

chopper - Common Options

gunzip -c reads.fastq.gz | chopper \
    --quality 10 \                 # Min quality
    --minlength 1000 \             # Min length
    --maxlength 50000 \            # Max length
    --headcrop 50 \                # Remove from start
    --tailcrop 50 \                # Remove from end
    --threads 4 \
    | gzip > filtered.fastq.gz

NanoFilt - Alternative Filter

# Filter with NanoFilt
gunzip -c reads.fastq.gz | NanoFilt -q 10 -l 1000 | gzip > filtered.fastq.gz

# With more options
gunzip -c reads.fastq.gz | NanoFilt \
    --quality 10 \
    --length 1000 \
    --maxlength 50000 \
    --headcrop 50 \
    | gzip > filtered.fastq.gz

Porechop - Adapter Trimming

# Trim adapters
porechop -i reads.fastq.gz -o trimmed.fastq.gz --threads 8

# With barcode splitting
porechop -i reads.fastq.gz -b output_dir/ --threads 8

Generate Summary Statistics

# Quick summary with seqkit
seqkit stats reads.fastq.gz

# Detailed stats
seqkit stats -a reads.fastq.gz

# Watch stats during basecalling
seqkit watch --fields ReadLen,MeanQual reads.fastq.gz

PycoQC - From Basecalling

# Generate QC report from sequencing_summary.txt
pycoQC -f sequencing_summary.txt -o pycoqc_report.html

# With BAM for alignment stats
pycoQC -f sequencing_summary.txt -a aligned.bam -o pycoqc_report.html

Calculate N50

# With seqkit
seqkit stats -a reads.fastq.gz | grep N50

# Manual calculation
seqkit fx2tab -l reads.fastq.gz | cut -f 2 | sort -rn | \
    awk '{sum+=$1; len[NR]=$1} END {
        target=sum/2; cumsum=0;
        for(i=1; i<=NR; i++) {
            cumsum+=len[i];
            if(cumsum>=target) {print "N50:", len[i]; break}
        }
    }'

Parse FASTQ Quality in Python

import numpy as np
from Bio import SeqIO

lengths = []
qualities = []

for record in SeqIO.parse('reads.fastq', 'fastq'):
    lengths.append(len(record))
    qualities.append(np.mean(record.letter_annotations['phred_quality']))

print(f'Total reads: {len(lengths)}')
print(f'Total bases: {sum(lengths):,}')
print(f'Mean length: {np.mean(lengths):.0f}')
print(f'Median length: {np.median(lengths):.0f}')
print(f'Mean quality: {np.mean(qualities):.1f}')

NanoPlot Output Files

File Description
NanoStats.txt Summary statistics
NanoPlot-report.html Interactive report
LengthvsQualityScatterPlot Length vs Q plot
WeightedHistogramReadlength Read length distribution
Yield_By_Length Cumulative yield

Key Parameters - NanoPlot

Parameter Description
--fastq Input FASTQ
--bam Input BAM
--summary Sequencing summary
-o Output directory
-t Threads
--N50 Show N50 line
--plots Plot types
--format Output formats

Key Parameters - chopper

Parameter Default Description
-q 0 Min quality
-l 0 Min length
--maxlength inf Max length
--headcrop 0 Trim from start
--tailcrop 0 Trim from end
-t 4 Threads

Quality Thresholds

Q Score Accuracy Typical Use
Q7 ~80% Very low quality
Q10 ~90% Basic filtering
Q15 ~97% Moderate filtering
Q20 ~99% High quality (SUP)
Q30 ~99.9% Very high (HiFi)

Related Skills

  • long-read-alignment - Align filtered reads
  • sequence-io - FASTQ handling
  • medaka-polishing - Polish with filtered reads
Weekly Installs
3
Installed on
windsurf2
trae2
opencode2
codex2
claude-code2
antigravity2