bio-metagenomics-kraken
SKILL.md
Kraken2 Classification
Basic Classification
# Classify reads against standard database
kraken2 --db /path/to/kraken2_db \
--output output.kraken \
--report report.txt \
reads.fastq.gz
Paired-End Reads
kraken2 --db /path/to/kraken2_db \
--paired \
--output output.kraken \
--report report.txt \
reads_R1.fastq.gz reads_R2.fastq.gz
Common Options
kraken2 --db /path/to/kraken2_db \
--threads 8 \ # CPU threads
--confidence 0.1 \ # Confidence threshold
--minimum-base-quality 20 \ # Quality filter
--output output.kraken \
--report report.txt \
--use-names \ # Add taxon names to output
--gzip-compressed \ # Input is gzipped
reads.fastq.gz
Memory-Efficient Mode
# For systems with limited RAM
kraken2 --db /path/to/kraken2_db \
--memory-mapping \ # Use disk-based database
--output output.kraken \
--report report.txt \
reads.fastq.gz
Report Only (No Per-Read Output)
# Save space by not writing per-read classifications
kraken2 --db /path/to/kraken2_db \
--report report.txt \
--report-zero-counts \ # Include taxa with 0 counts
reads.fastq.gz
Classified/Unclassified Output
# Separate classified and unclassified reads
kraken2 --db /path/to/kraken2_db \
--classified-out classified#.fq \ # # replaced by 1/2 for PE
--unclassified-out unclassified#.fq \
--output output.kraken \
--report report.txt \
--paired \
reads_R1.fastq.gz reads_R2.fastq.gz
Build Custom Database
# Download taxonomy
kraken2-build --download-taxonomy --db custom_db
# Download specific libraries
kraken2-build --download-library bacteria --db custom_db
kraken2-build --download-library archaea --db custom_db
kraken2-build --download-library viral --db custom_db
# Build database
kraken2-build --build --db custom_db --threads 8
# Clean up intermediate files
kraken2-build --clean --db custom_db
Add Custom Sequences
# Add FASTA sequences to library
kraken2-build --add-to-library custom_genomes.fasta --db custom_db
# Then build
kraken2-build --build --db custom_db
Inspect Database
# View database contents
kraken2-inspect --db /path/to/kraken2_db | head -50
Report Format
17.45 1745 1745 U 0 unclassified
82.55 8255 48 R 1 root
82.07 8207 2 R1 131567 cellular organisms
81.99 8199 132 D 2 Bacteria
76.23 7623 178 P 1224 Proteobacteria
Columns:
- Percentage of reads
- Number of reads rooted at taxon
- Number of reads directly assigned
- Rank code (U, R, D, P, C, O, F, G, S)
- NCBI taxon ID
- Scientific name
Parse Kraken Output in Python
import pandas as pd
report = pd.read_csv('report.txt', sep='\t', header=None,
names=['pct', 'reads_clade', 'reads_taxon', 'rank', 'taxid', 'name'])
report['name'] = report['name'].str.strip()
species = report[report['rank'] == 'S']
species_sorted = species.sort_values('pct', ascending=False)
species_sorted.head(20)
Filter Report by Rank
# Get only species-level classifications
awk '$4 == "S"' report.txt > species_report.txt
# Get genus level
awk '$4 == "G"' report.txt > genus_report.txt
Key Parameters
| Parameter | Default | Description |
|---|---|---|
| --db | required | Database path |
| --threads | 1 | CPU threads |
| --confidence | 0.0 | Confidence threshold (0-1) |
| --minimum-base-quality | 0 | Phred quality threshold |
| --memory-mapping | false | Use disk-based database |
| --paired | false | Paired-end mode |
| --use-names | false | Include taxon names |
| --report-zero-counts | false | Include 0-count taxa |
Database Libraries
| Library | Content |
|---|---|
| bacteria | RefSeq complete bacterial genomes |
| archaea | RefSeq complete archaeal genomes |
| viral | RefSeq complete viral genomes |
| plasmid | RefSeq plasmid nucleotide sequences |
| human | GRCh38 human genome |
| fungi | RefSeq fungi |
| protozoa | RefSeq protozoa |
| UniVec_Core | Common vector sequences |
Related Skills
- abundance-estimation - Estimate abundances with Bracken
- metaphlan-profiling - Alternative marker-based profiling
- metagenome-visualization - Visualize results
Weekly Installs
4
Repository
gptomics/bioskillsInstalled on
claude-code3
windsurf2
trae2
opencode2
codex2
antigravity2