skills/starlitnightly/omicverse/fastq-analysis-pipeline

fastq-analysis-pipeline

SKILL.md

Overview

OmicVerse provides a complete FASTQ-to-count-matrix pipeline via the ov.alignment module. This skill covers:

  • SRA data acquisition: prefetch and fqdump (fasterq-dump wrapper)
  • Quality control: fastp for adapter trimming and QC reports
  • RNA-seq alignment: STAR aligner with auto-index building
  • Gene quantification: featureCount (subread featureCounts wrapper)
  • Single-cell path: ref and count via kb-python (kallisto/bustools)
  • Parallel SRA download: parallel_fastq_dump

All functions share a common CLI infrastructure (_cli_utils.py) that handles tool resolution, auto-installation via conda/mamba, parallel execution, and streaming output.

Instructions

  1. Environment setup

    • Bioinformatics tools are resolved automatically from PATH or the active conda environment.
    • If auto_install=True (default), missing tools are installed via mamba/conda on demand.
    • Supported tools: prefetch, vdb-validate, fasterq-dump, fastp, STAR, samtools, featureCounts, pigz, gzip.
    • For the single-cell path, ensure kb-python is installed: pip install kb-python.
  2. SRA data download (ov.alignment.prefetch + ov.alignment.fqdump)

    • Use prefetch first for reliable downloads with integrity validation (vdb-validate).
    • Then convert to FASTQ with fqdump. It auto-detects single-end vs paired-end.
    • fqdump can also work directly from SRR accessions without prefetch.
    • Both support retry with exponential backoff for network errors.
    import omicverse as ov
    
    # Step 1: Prefetch SRA files (optional but recommended)
    pre = ov.alignment.prefetch(['SRR1234567', 'SRR1234568'], output_dir='prefetch', jobs=4)
    
    # Step 2: Convert to FASTQ
    fq = ov.alignment.fqdump(['SRR1234567', 'SRR1234568'],
                              output_dir='fastq', sra_dir='prefetch',
                              gzip=True, threads=8, jobs=4)
    
  3. FASTQ quality control (ov.alignment.fastp)

    • Runs fastp for adapter trimming, quality filtering, and QC reporting.
    • Supports single-end and paired-end reads.
    • Produces per-sample JSON and HTML QC reports.
    • Sample format: tuple of (sample_name, fq1_path, fq2_path_or_None).
    samples = [
        ('S1', 'fastq/SRR1234567/SRR1234567_1.fastq.gz', 'fastq/SRR1234567/SRR1234567_2.fastq.gz'),
        ('S2', 'fastq/SRR1234568/SRR1234568_1.fastq.gz', 'fastq/SRR1234568/SRR1234568_2.fastq.gz'),
    ]
    clean = ov.alignment.fastp(samples, output_dir='fastp', threads=8, jobs=2)
    
  4. STAR alignment (ov.alignment.STAR)

    • Aligns FASTQ reads using the STAR aligner.
    • Auto-index building: set auto_index=True (default) with genome_fasta_files and gtf to build index automatically if missing.
    • Produces coordinate-sorted BAM files.
    • Handles gzip-compressed FASTQs automatically (uses pigz/gzip/zcat).
    • Use strict=False (default) for graceful error handling per sample.
    # Prepare samples from fastp output
    star_samples = [
        ('S1', 'fastp/S1/S1_clean_1.fastq.gz', 'fastp/S1/S1_clean_2.fastq.gz'),
        ('S2', 'fastp/S2/S2_clean_1.fastq.gz', 'fastp/S2/S2_clean_2.fastq.gz'),
    ]
    bams = ov.alignment.STAR(
        star_samples,
        genome_dir='star_index',
        output_dir='star_out',
        gtf='genes.gtf',
        genome_fasta_files=['genome.fa'],
        threads=8,
        memory='50G',
    )
    
  5. Gene quantification (ov.alignment.featureCount)

    • Counts aligned reads per gene using featureCounts (subread).
    • Auto-detects paired-end from BAM headers (via pysam or samtools).
    • auto_fix=True (default) retries with corrected paired-end flag on error.
    • gene_mapping=True maps gene_id to gene_name from the GTF.
    • merge_matrix=True produces a combined count matrix across all samples.
    bam_items = [
        ('S1', 'star_out/S1/Aligned.sortedByCoord.out.bam'),
        ('S2', 'star_out/S2/Aligned.sortedByCoord.out.bam'),
    ]
    counts = ov.alignment.featureCount(
        bam_items,
        gtf='genes.gtf',
        output_dir='counts',
        gene_mapping=True,
        merge_matrix=True,
        threads=8,
    )
    # counts is a pandas DataFrame (gene_id x samples)
    
  6. Single-cell path (ov.alignment.ref + ov.alignment.count)

    • Uses kb-python (kallisto + bustools) for single-cell RNA-seq quantification.
    • ref() builds a kallisto index and transcript-to-gene mapping.
    • count() quantifies single-cell data with barcode/UMI handling.
    • Supports technologies: 10XV2, 10XV3, BULK, and custom.
    • Output formats: h5ad, loom, cellranger MTX.
    # Build reference index
    ref_result = ov.alignment.ref(
        index_path='kb_ref/index.idx',
        t2g_path='kb_ref/t2g.txt',
        fasta_paths=['genome.fa'],
        gtf_paths=['genes.gtf'],
        threads=8,
    )
    
    # Quantify 10x v3 data
    count_result = ov.alignment.count(
        index_path='kb_ref/index.idx',
        t2g_path='kb_ref/t2g.txt',
        technology='10XV3',
        fastq_paths=['sample_R1.fastq.gz', 'sample_R2.fastq.gz'],
        output_path='kb_out',
        h5ad=True,
        filter_barcodes=True,
        threads=8,
    )
    
  7. Wiring fastp output into STAR input

    • fastp output is a list of dicts with keys: sample, clean1, clean2, json, html.
    • Convert to STAR sample tuples:
    star_samples = [
        (r['sample'], r['clean1'], r['clean2'] if r['clean2'] else None)
        for r in (clean if isinstance(clean, list) else [clean])
    ]
    
  8. Wiring STAR output into featureCount input

    • STAR output is a list of dicts with keys: sample, bam (or error).
    • Convert to featureCount items:
    bam_items = [
        (r['sample'], r['bam'])
        for r in (bams if isinstance(bams, list) else [bams])
        if 'bam' in r
    ]
    
  9. Skipping completed steps

    • All functions check for existing outputs and skip if overwrite=False (default).
    • Set overwrite=True to force re-execution.
  10. Troubleshooting

    • If a tool is not found, check auto_install=True and that conda/mamba is accessible.
    • For STAR index errors, ensure genome_fasta_files points to uncompressed or gzip FASTA files.
    • For featureCounts paired-end detection errors, auto_fix=True handles most cases automatically.
    • GTF files can be gzip-compressed; they are auto-decompressed as needed.

Critical API Reference

Sample Format Convention

All alignment functions use a consistent sample tuple format:

  • FASTQ samples: (sample_name, fq1_path, fq2_path_or_None)
  • BAM items: (sample_name, bam_path) or (sample_name, bam_path, is_paired_bool)
  • Single samples can be passed as a single tuple; multiple as a list of tuples.
  • When a single tuple is passed, the return value is a single dict; for a list, a list of dicts.

Auto-installation

# All functions support these parameters:
auto_install=True   # Auto-install missing tools via conda/mamba
overwrite=False     # Skip if outputs already exist
threads=8           # Per-tool thread count
jobs=None           # Concurrent job count (auto-detected from CPU count)

Examples

  • Bulk RNA-seq from SRA: prefetch -> fqdump -> fastp -> STAR -> featureCount -> pandas DataFrame
  • Single-cell 10x v3: ref -> count with technology='10XV3' -> h5ad AnnData
  • Local FASTQ files: Skip download steps, start directly with fastp -> STAR -> featureCount

References

Weekly Installs
13
GitHub Stars
852
First Seen
12 days ago
Installed on
gemini-cli13
opencode13
codebuddy13
github-copilot13
codex13
kimi-cli13