scikit-bio

SKILL.md

scikit-bio

A Python library for biological data analysis spanning sequence handling, phylogenetics, microbial ecology, and multivariate statistics.

When to Apply

Use this skill when users need to:

Task Category Examples
Sequence work DNA/RNA/protein manipulation, motif finding, translation
File handling FASTA, FASTQ, GenBank, Newick, BIOM I/O
Alignments Pairwise or multiple sequence alignment
Phylogenetics Tree construction, manipulation, distance calculations
Diversity metrics Alpha diversity (Shannon, Faith's PD), beta diversity (Bray-Curtis, UniFrac)
Ordination PCoA, CCA, RDA for dimensionality reduction
Statistical tests PERMANOVA, ANOSIM, Mantel tests
Microbiome analysis Feature tables, rarefaction, community comparisons

Installation

uv pip install scikit-bio

Sequences

Work with biological sequences through specialized DNA, RNA, and Protein classes.

import skbio

# Load from file
seq = skbio.DNA.read('gene.fasta')

# Common operations
complement = seq.reverse_complement()
messenger = seq.transcribe()
peptide = messenger.translate()

# Pattern search
hits = seq.find_with_regex('ATG[ACGT]{6}TAA')

# Properties
contains_ambiguous = seq.has_degenerates()
clean_seq = seq.degap()

Metadata types:

  • Sequence-level: ID, description, source organism
  • Positional: Per-base quality scores (from FASTQ)
  • Interval: Feature annotations, gene boundaries

Sequence Alignment

Pairwise and multiple alignment using dynamic programming.

from skbio.alignment import local_pairwise_align_ssw, TabularMSA

# Local alignment (Smith-Waterman)
result = local_pairwise_align_ssw(query_seq, target_seq)

# Load existing alignment
alignment = TabularMSA.read('msa.fasta', constructor=skbio.DNA)

# Derive consensus
consensus_seq = alignment.consensus()

Notes:

  • local_pairwise_align_ssw provides fast SSW-based local alignment
  • StripedSmithWaterman handles protein sequences with substitution matrices
  • Affine gap penalties suit biological sequences best

Phylogenetic Trees

Construct and analyze evolutionary trees.

from skbio import TreeNode
from skbio.tree import nj, upgma

# Build from distances
phylogeny = nj(distance_matrix)

# Load existing tree
phylogeny = TreeNode.read('species.nwk')

# Extract subset
clade = phylogeny.shear(['mouse', 'rat', 'human'])

# Enumerate leaf nodes
leaves = list(phylogeny.tips())

# Common ancestor
ancestor = phylogeny.lowest_common_ancestor(['mouse', 'rat'])

# Branch length between taxa
branch_dist = phylogeny.find('mouse').distance(phylogeny.find('rat'))

# Pairwise distances for all tips
pairwise_dm = phylogeny.cophenetic_matrix()

# Topology comparison
rf_diff = phylogeny.robinson_foulds(other_tree)

Tree construction methods:

Method Use case
nj() Standard neighbor-joining
upgma() Assumes molecular clock
bme() Scalable for large datasets

Diversity Analysis

Calculate ecological diversity metrics.

Alpha Diversity (within-sample)

from skbio.diversity import alpha_diversity

# Sample abundance matrix
abundances = np.array([
    [45, 12, 0, 8],
    [5, 0, 33, 17],
    [20, 20, 15, 10]
])
samples = ['gut_1', 'gut_2', 'gut_3']

# Richness and evenness metrics
shannon_vals = alpha_diversity('shannon', abundances, ids=samples)
simpson_vals = alpha_diversity('simpson', abundances, ids=samples)

# Phylogenetic diversity (requires tree)
faith_vals = alpha_diversity('faith_pd', abundances, ids=samples,
                             tree=phylogeny, otu_ids=feature_names)

Beta Diversity (between-sample)

from skbio.diversity import beta_diversity

# Distance matrices
bray_dm = beta_diversity('braycurtis', abundances, ids=samples)
unifrac_dm = beta_diversity('weighted_unifrac', abundances, ids=samples,
                            tree=phylogeny, otu_ids=feature_names)

Key points:

  • Input must be integer counts, not proportions
  • Phylogenetic metrics require a tree matching feature IDs
  • partial_beta_diversity() computes specific sample pairs efficiently

Ordination

Project high-dimensional data to visualizable spaces.

from skbio.stats.ordination import pcoa, cca

# PCoA from distance matrix
coords = pcoa(bray_dm)
axis1 = coords.samples['PC1']
axis2 = coords.samples['PC2']
variance_explained = coords.proportion_explained

# CCA with environmental predictors
constrained = cca(species_abundances, environmental_vars)

Methods:

Function Input Purpose
pcoa() Distance matrix Unconstrained ordination
cca() Abundance + environment Constrained ordination (unimodal)
rda() Abundance + environment Constrained ordination (linear)

Statistical Tests

Hypothesis testing for ecological data.

from skbio.stats.distance import permanova, anosim, mantel

# Group comparison
treatment_groups = ['control', 'control', 'treated', 'treated']
perm_result = permanova(bray_dm, treatment_groups, permutations=999)
print(f"F = {perm_result['test statistic']:.3f}, p = {perm_result['p-value']:.4f}")

# Alternative group test
anos_result = anosim(bray_dm, treatment_groups, permutations=999)

# Matrix correlation
r, pval, n = mantel(genetic_dm, geographic_dm, method='spearman', permutations=999)
print(f"r = {r:.3f}, p = {pval:.4f}")

Test overview:

Test Purpose Key output
PERMANOVA Group differences F-statistic, p-value
ANOSIM Group differences (alternative) R-statistic, p-value
PERMDISP Dispersion homogeneity Tests PERMANOVA assumption
Mantel Matrix correlation Correlation coefficient, p-value

File I/O

Read and write 19+ biological formats.

import skbio

# Automatic format detection
tree = skbio.TreeNode.read('phylogeny.nwk')

# Memory-efficient iteration
for record in skbio.io.read('reads.fastq', format='fastq', constructor=skbio.DNA):
    if record.positional_metadata['quality'].mean() > 30:
        process(record)

# Format conversion
records = skbio.io.read('sequences.fastq', format='fastq', constructor=skbio.DNA)
skbio.io.write(records, format='fasta', into='sequences.fasta')

Supported formats:

Category Formats
Sequences FASTA, FASTQ, GenBank, EMBL, QSeq
Alignments Clustal, PHYLIP, Stockholm
Trees Newick
Tables BIOM (HDF5/JSON)
Distances Delimited matrices

Distance Matrices

Store and manipulate pairwise distances.

from skbio import DistanceMatrix
import numpy as np

# Create from array
distances = np.array([
    [0.0, 0.3, 0.7],
    [0.3, 0.0, 0.5],
    [0.7, 0.5, 0.0]
])
dm = DistanceMatrix(distances, ids=['sp_A', 'sp_B', 'sp_C'])

# Access elements
pair_dist = dm['sp_A', 'sp_B']
all_from_a = dm['sp_A']

# Subset
subset_dm = dm.filter(['sp_A', 'sp_C'])

Feature Tables (BIOM)

Handle OTU/ASV abundance tables.

from skbio import Table

# Load table
tbl = Table.read('features.biom')

# Inspect structure
sample_names = tbl.ids(axis='sample')
feature_names = tbl.ids(axis='observation')

# Filter by abundance
filtered = tbl.filter(lambda row, id_, md: row.sum() > 500, axis='sample')

# Convert to pandas
df = tbl.to_dataframe()

Protein Embeddings

Bridge language model outputs with scikit-bio analysis.

from skbio.embedding import ProteinEmbedding

# Load embeddings (from ESM, ProtTrans, etc.)
emb = ProteinEmbedding(embedding_matrix, protein_ids)

# Create distance matrix for downstream analysis
emb_dm = emb.to_distances(metric='cosine')

# Ordination visualization
emb_pcoa = emb.to_ordination(metric='euclidean', method='pcoa')

Typical Workflows

Microbiome diversity study:

  1. Load BIOM table and phylogenetic tree
  2. Calculate alpha diversity per sample
  3. Compute beta diversity (UniFrac)
  4. Ordinate with PCoA
  5. Test group differences with PERMANOVA

Phylogenetic inference:

  1. Read sequences from FASTA
  2. Perform multiple alignment
  3. Calculate pairwise distances
  4. Construct tree with neighbor-joining
  5. Analyze clade relationships

Sequence processing:

  1. Read FASTQ with quality scores
  2. Filter low-quality reads
  3. Search for motifs
  4. Translate to protein
  5. Export as FASTA

Performance Tips

  • Use generators for large sequence files
  • Prefer BIOM HDF5 over JSON for big tables
  • Apply partial_beta_diversity() when computing only specific pairs
  • Choose BME for very large phylogenies

Ecosystem Integration

Library Integration
pandas DataFrames from distance matrices, diversity results
numpy Array conversions throughout
matplotlib/seaborn Plot ordination results, heatmaps
scikit-learn Distance matrices as input
QIIME 2 Native BIOM, tree, distance matrix compatibility

Reference Files

File Contents
references/api-reference.md Complete method signatures, parameters, extended examples, and troubleshooting
Weekly Installs
26
First Seen
Feb 25, 2026
Installed on
mcpjam26
claude-code26
replit26
junie26
windsurf26
zencoder26