fastq-analysis-pipeline
SKILL.md
Overview
OmicVerse provides a complete FASTQ-to-count-matrix pipeline via the ov.alignment module. This skill covers:
- SRA data acquisition:
prefetchandfqdump(fasterq-dump wrapper) - Quality control:
fastpfor adapter trimming and QC reports - RNA-seq alignment:
STARaligner with auto-index building - Gene quantification:
featureCount(subread featureCounts wrapper) - Single-cell path:
refandcountvia kb-python (kallisto/bustools) - Parallel SRA download:
parallel_fastq_dump
All functions share a common CLI infrastructure (_cli_utils.py) that handles tool resolution, auto-installation via conda/mamba, parallel execution, and streaming output.
Instructions
-
Environment setup
- Bioinformatics tools are resolved automatically from PATH or the active conda environment.
- If
auto_install=True(default), missing tools are installed via mamba/conda on demand. - Supported tools:
prefetch,vdb-validate,fasterq-dump,fastp,STAR,samtools,featureCounts,pigz,gzip. - For the single-cell path, ensure
kb-pythonis installed:pip install kb-python.
-
SRA data download (
ov.alignment.prefetch+ov.alignment.fqdump)- Use
prefetchfirst for reliable downloads with integrity validation (vdb-validate). - Then convert to FASTQ with
fqdump. It auto-detects single-end vs paired-end. fqdumpcan also work directly from SRR accessions without prefetch.- Both support retry with exponential backoff for network errors.
import omicverse as ov # Step 1: Prefetch SRA files (optional but recommended) pre = ov.alignment.prefetch(['SRR1234567', 'SRR1234568'], output_dir='prefetch', jobs=4) # Step 2: Convert to FASTQ fq = ov.alignment.fqdump(['SRR1234567', 'SRR1234568'], output_dir='fastq', sra_dir='prefetch', gzip=True, threads=8, jobs=4) - Use
-
FASTQ quality control (
ov.alignment.fastp)- Runs fastp for adapter trimming, quality filtering, and QC reporting.
- Supports single-end and paired-end reads.
- Produces per-sample JSON and HTML QC reports.
- Sample format: tuple of
(sample_name, fq1_path, fq2_path_or_None).
samples = [ ('S1', 'fastq/SRR1234567/SRR1234567_1.fastq.gz', 'fastq/SRR1234567/SRR1234567_2.fastq.gz'), ('S2', 'fastq/SRR1234568/SRR1234568_1.fastq.gz', 'fastq/SRR1234568/SRR1234568_2.fastq.gz'), ] clean = ov.alignment.fastp(samples, output_dir='fastp', threads=8, jobs=2) -
STAR alignment (
ov.alignment.STAR)- Aligns FASTQ reads using the STAR aligner.
- Auto-index building: set
auto_index=True(default) withgenome_fasta_filesandgtfto build index automatically if missing. - Produces coordinate-sorted BAM files.
- Handles gzip-compressed FASTQs automatically (uses pigz/gzip/zcat).
- Use
strict=False(default) for graceful error handling per sample.
# Prepare samples from fastp output star_samples = [ ('S1', 'fastp/S1/S1_clean_1.fastq.gz', 'fastp/S1/S1_clean_2.fastq.gz'), ('S2', 'fastp/S2/S2_clean_1.fastq.gz', 'fastp/S2/S2_clean_2.fastq.gz'), ] bams = ov.alignment.STAR( star_samples, genome_dir='star_index', output_dir='star_out', gtf='genes.gtf', genome_fasta_files=['genome.fa'], threads=8, memory='50G', ) -
Gene quantification (
ov.alignment.featureCount)- Counts aligned reads per gene using featureCounts (subread).
- Auto-detects paired-end from BAM headers (via pysam or samtools).
auto_fix=True(default) retries with corrected paired-end flag on error.gene_mapping=Truemaps gene_id to gene_name from the GTF.merge_matrix=Trueproduces a combined count matrix across all samples.
bam_items = [ ('S1', 'star_out/S1/Aligned.sortedByCoord.out.bam'), ('S2', 'star_out/S2/Aligned.sortedByCoord.out.bam'), ] counts = ov.alignment.featureCount( bam_items, gtf='genes.gtf', output_dir='counts', gene_mapping=True, merge_matrix=True, threads=8, ) # counts is a pandas DataFrame (gene_id x samples) -
Single-cell path (
ov.alignment.ref+ov.alignment.count)- Uses kb-python (kallisto + bustools) for single-cell RNA-seq quantification.
ref()builds a kallisto index and transcript-to-gene mapping.count()quantifies single-cell data with barcode/UMI handling.- Supports technologies: 10XV2, 10XV3, BULK, and custom.
- Output formats: h5ad, loom, cellranger MTX.
# Build reference index ref_result = ov.alignment.ref( index_path='kb_ref/index.idx', t2g_path='kb_ref/t2g.txt', fasta_paths=['genome.fa'], gtf_paths=['genes.gtf'], threads=8, ) # Quantify 10x v3 data count_result = ov.alignment.count( index_path='kb_ref/index.idx', t2g_path='kb_ref/t2g.txt', technology='10XV3', fastq_paths=['sample_R1.fastq.gz', 'sample_R2.fastq.gz'], output_path='kb_out', h5ad=True, filter_barcodes=True, threads=8, ) -
Wiring fastp output into STAR input
- fastp output is a list of dicts with keys:
sample,clean1,clean2,json,html. - Convert to STAR sample tuples:
star_samples = [ (r['sample'], r['clean1'], r['clean2'] if r['clean2'] else None) for r in (clean if isinstance(clean, list) else [clean]) ] - fastp output is a list of dicts with keys:
-
Wiring STAR output into featureCount input
- STAR output is a list of dicts with keys:
sample,bam(orerror). - Convert to featureCount items:
bam_items = [ (r['sample'], r['bam']) for r in (bams if isinstance(bams, list) else [bams]) if 'bam' in r ] - STAR output is a list of dicts with keys:
-
Skipping completed steps
- All functions check for existing outputs and skip if
overwrite=False(default). - Set
overwrite=Trueto force re-execution.
- All functions check for existing outputs and skip if
-
Troubleshooting
- If a tool is not found, check
auto_install=Trueand that conda/mamba is accessible. - For STAR index errors, ensure
genome_fasta_filespoints to uncompressed or gzip FASTA files. - For featureCounts paired-end detection errors,
auto_fix=Truehandles most cases automatically. - GTF files can be gzip-compressed; they are auto-decompressed as needed.
- If a tool is not found, check
Critical API Reference
Sample Format Convention
All alignment functions use a consistent sample tuple format:
- FASTQ samples:
(sample_name, fq1_path, fq2_path_or_None) - BAM items:
(sample_name, bam_path)or(sample_name, bam_path, is_paired_bool) - Single samples can be passed as a single tuple; multiple as a list of tuples.
- When a single tuple is passed, the return value is a single dict; for a list, a list of dicts.
Auto-installation
# All functions support these parameters:
auto_install=True # Auto-install missing tools via conda/mamba
overwrite=False # Skip if outputs already exist
threads=8 # Per-tool thread count
jobs=None # Concurrent job count (auto-detected from CPU count)
Examples
- Bulk RNA-seq from SRA:
prefetch->fqdump->fastp->STAR->featureCount-> pandas DataFrame - Single-cell 10x v3:
ref->countwithtechnology='10XV3'-> h5ad AnnData - Local FASTQ files: Skip download steps, start directly with
fastp->STAR->featureCount
References
- See reference.md for copy-paste-ready code templates.
Weekly Installs
13
Repository
starlitnightly/omicverseGitHub Stars
852
First Seen
12 days ago
Security Audits
Installed on
gemini-cli13
opencode13
codebuddy13
github-copilot13
codex13
kimi-cli13