scikit-bio
scikit-bio
A Python library for biological data analysis spanning sequence handling, phylogenetics, microbial ecology, and multivariate statistics.
When to Apply
Use this skill when users need to:
| Task Category | Examples |
|---|---|
| Sequence work | DNA/RNA/protein manipulation, motif finding, translation |
| File handling | FASTA, FASTQ, GenBank, Newick, BIOM I/O |
| Alignments | Pairwise or multiple sequence alignment |
| Phylogenetics | Tree construction, manipulation, distance calculations |
| Diversity metrics | Alpha diversity (Shannon, Faith's PD), beta diversity (Bray-Curtis, UniFrac) |
| Ordination | PCoA, CCA, RDA for dimensionality reduction |
| Statistical tests | PERMANOVA, ANOSIM, Mantel tests |
| Microbiome analysis | Feature tables, rarefaction, community comparisons |
Installation
uv pip install scikit-bio
Sequences
Work with biological sequences through specialized DNA, RNA, and Protein classes.
import skbio
# Load from file
seq = skbio.DNA.read('gene.fasta')
# Common operations
complement = seq.reverse_complement()
messenger = seq.transcribe()
peptide = messenger.translate()
# Pattern search
hits = seq.find_with_regex('ATG[ACGT]{6}TAA')
# Properties
contains_ambiguous = seq.has_degenerates()
clean_seq = seq.degap()
Metadata types:
- Sequence-level: ID, description, source organism
- Positional: Per-base quality scores (from FASTQ)
- Interval: Feature annotations, gene boundaries
Sequence Alignment
Pairwise and multiple alignment using dynamic programming.
from skbio.alignment import local_pairwise_align_ssw, TabularMSA
# Local alignment (Smith-Waterman)
result = local_pairwise_align_ssw(query_seq, target_seq)
# Load existing alignment
alignment = TabularMSA.read('msa.fasta', constructor=skbio.DNA)
# Derive consensus
consensus_seq = alignment.consensus()
Notes:
local_pairwise_align_sswprovides fast SSW-based local alignmentStripedSmithWatermanhandles protein sequences with substitution matrices- Affine gap penalties suit biological sequences best
Phylogenetic Trees
Construct and analyze evolutionary trees.
from skbio import TreeNode
from skbio.tree import nj, upgma
# Build from distances
phylogeny = nj(distance_matrix)
# Load existing tree
phylogeny = TreeNode.read('species.nwk')
# Extract subset
clade = phylogeny.shear(['mouse', 'rat', 'human'])
# Enumerate leaf nodes
leaves = list(phylogeny.tips())
# Common ancestor
ancestor = phylogeny.lowest_common_ancestor(['mouse', 'rat'])
# Branch length between taxa
branch_dist = phylogeny.find('mouse').distance(phylogeny.find('rat'))
# Pairwise distances for all tips
pairwise_dm = phylogeny.cophenetic_matrix()
# Topology comparison
rf_diff = phylogeny.robinson_foulds(other_tree)
Tree construction methods:
| Method | Use case |
|---|---|
nj() |
Standard neighbor-joining |
upgma() |
Assumes molecular clock |
bme() |
Scalable for large datasets |
Diversity Analysis
Calculate ecological diversity metrics.
Alpha Diversity (within-sample)
from skbio.diversity import alpha_diversity
# Sample abundance matrix
abundances = np.array([
[45, 12, 0, 8],
[5, 0, 33, 17],
[20, 20, 15, 10]
])
samples = ['gut_1', 'gut_2', 'gut_3']
# Richness and evenness metrics
shannon_vals = alpha_diversity('shannon', abundances, ids=samples)
simpson_vals = alpha_diversity('simpson', abundances, ids=samples)
# Phylogenetic diversity (requires tree)
faith_vals = alpha_diversity('faith_pd', abundances, ids=samples,
tree=phylogeny, otu_ids=feature_names)
Beta Diversity (between-sample)
from skbio.diversity import beta_diversity
# Distance matrices
bray_dm = beta_diversity('braycurtis', abundances, ids=samples)
unifrac_dm = beta_diversity('weighted_unifrac', abundances, ids=samples,
tree=phylogeny, otu_ids=feature_names)
Key points:
- Input must be integer counts, not proportions
- Phylogenetic metrics require a tree matching feature IDs
partial_beta_diversity()computes specific sample pairs efficiently
Ordination
Project high-dimensional data to visualizable spaces.
from skbio.stats.ordination import pcoa, cca
# PCoA from distance matrix
coords = pcoa(bray_dm)
axis1 = coords.samples['PC1']
axis2 = coords.samples['PC2']
variance_explained = coords.proportion_explained
# CCA with environmental predictors
constrained = cca(species_abundances, environmental_vars)
Methods:
| Function | Input | Purpose |
|---|---|---|
pcoa() |
Distance matrix | Unconstrained ordination |
cca() |
Abundance + environment | Constrained ordination (unimodal) |
rda() |
Abundance + environment | Constrained ordination (linear) |
Statistical Tests
Hypothesis testing for ecological data.
from skbio.stats.distance import permanova, anosim, mantel
# Group comparison
treatment_groups = ['control', 'control', 'treated', 'treated']
perm_result = permanova(bray_dm, treatment_groups, permutations=999)
print(f"F = {perm_result['test statistic']:.3f}, p = {perm_result['p-value']:.4f}")
# Alternative group test
anos_result = anosim(bray_dm, treatment_groups, permutations=999)
# Matrix correlation
r, pval, n = mantel(genetic_dm, geographic_dm, method='spearman', permutations=999)
print(f"r = {r:.3f}, p = {pval:.4f}")
Test overview:
| Test | Purpose | Key output |
|---|---|---|
| PERMANOVA | Group differences | F-statistic, p-value |
| ANOSIM | Group differences (alternative) | R-statistic, p-value |
| PERMDISP | Dispersion homogeneity | Tests PERMANOVA assumption |
| Mantel | Matrix correlation | Correlation coefficient, p-value |
File I/O
Read and write 19+ biological formats.
import skbio
# Automatic format detection
tree = skbio.TreeNode.read('phylogeny.nwk')
# Memory-efficient iteration
for record in skbio.io.read('reads.fastq', format='fastq', constructor=skbio.DNA):
if record.positional_metadata['quality'].mean() > 30:
process(record)
# Format conversion
records = skbio.io.read('sequences.fastq', format='fastq', constructor=skbio.DNA)
skbio.io.write(records, format='fasta', into='sequences.fasta')
Supported formats:
| Category | Formats |
|---|---|
| Sequences | FASTA, FASTQ, GenBank, EMBL, QSeq |
| Alignments | Clustal, PHYLIP, Stockholm |
| Trees | Newick |
| Tables | BIOM (HDF5/JSON) |
| Distances | Delimited matrices |
Distance Matrices
Store and manipulate pairwise distances.
from skbio import DistanceMatrix
import numpy as np
# Create from array
distances = np.array([
[0.0, 0.3, 0.7],
[0.3, 0.0, 0.5],
[0.7, 0.5, 0.0]
])
dm = DistanceMatrix(distances, ids=['sp_A', 'sp_B', 'sp_C'])
# Access elements
pair_dist = dm['sp_A', 'sp_B']
all_from_a = dm['sp_A']
# Subset
subset_dm = dm.filter(['sp_A', 'sp_C'])
Feature Tables (BIOM)
Handle OTU/ASV abundance tables.
from skbio import Table
# Load table
tbl = Table.read('features.biom')
# Inspect structure
sample_names = tbl.ids(axis='sample')
feature_names = tbl.ids(axis='observation')
# Filter by abundance
filtered = tbl.filter(lambda row, id_, md: row.sum() > 500, axis='sample')
# Convert to pandas
df = tbl.to_dataframe()
Protein Embeddings
Bridge language model outputs with scikit-bio analysis.
from skbio.embedding import ProteinEmbedding
# Load embeddings (from ESM, ProtTrans, etc.)
emb = ProteinEmbedding(embedding_matrix, protein_ids)
# Create distance matrix for downstream analysis
emb_dm = emb.to_distances(metric='cosine')
# Ordination visualization
emb_pcoa = emb.to_ordination(metric='euclidean', method='pcoa')
Typical Workflows
Microbiome diversity study:
- Load BIOM table and phylogenetic tree
- Calculate alpha diversity per sample
- Compute beta diversity (UniFrac)
- Ordinate with PCoA
- Test group differences with PERMANOVA
Phylogenetic inference:
- Read sequences from FASTA
- Perform multiple alignment
- Calculate pairwise distances
- Construct tree with neighbor-joining
- Analyze clade relationships
Sequence processing:
- Read FASTQ with quality scores
- Filter low-quality reads
- Search for motifs
- Translate to protein
- Export as FASTA
Performance Tips
- Use generators for large sequence files
- Prefer BIOM HDF5 over JSON for big tables
- Apply
partial_beta_diversity()when computing only specific pairs - Choose BME for very large phylogenies
Ecosystem Integration
| Library | Integration |
|---|---|
| pandas | DataFrames from distance matrices, diversity results |
| numpy | Array conversions throughout |
| matplotlib/seaborn | Plot ordination results, heatmaps |
| scikit-learn | Distance matrices as input |
| QIIME 2 | Native BIOM, tree, distance matrix compatibility |
Reference Files
| File | Contents |
|---|---|
| references/api-reference.md | Complete method signatures, parameters, extended examples, and troubleshooting |