scikit-bio
scikit-bio
A Python library for biological data analysis spanning sequence handling, phylogenetics, microbial ecology, and multivariate statistics.
When to Apply
Use this skill when users need to:
| Task Category | Examples |
|---|---|
| Sequence work | DNA/RNA/protein manipulation, motif finding, translation |
| File handling | FASTA, FASTQ, GenBank, Newick, BIOM I/O |
| Alignments | Pairwise or multiple sequence alignment |
| Phylogenetics | Tree construction, manipulation, distance calculations |
| Diversity metrics | Alpha diversity (Shannon, Faith's PD), beta diversity (Bray-Curtis, UniFrac) |
| Ordination | PCoA, CCA, RDA for dimensionality reduction |
| Statistical tests | PERMANOVA, ANOSIM, Mantel tests |
| Microbiome analysis | Feature tables, rarefaction, community comparisons |
Installation
uv pip install scikit-bio
Sequences
Work with biological sequences through specialized DNA, RNA, and Protein classes.
import skbio
# Load from file
seq = skbio.DNA.read('gene.fasta')
# Common operations
complement = seq.reverse_complement()
messenger = seq.transcribe()
peptide = messenger.translate()
# Pattern search
hits = seq.find_with_regex('ATG[ACGT]{6}TAA')
# Properties
contains_ambiguous = seq.has_degenerates()
clean_seq = seq.degap()
Metadata types:
- Sequence-level: ID, description, source organism
- Positional: Per-base quality scores (from FASTQ)
- Interval: Feature annotations, gene boundaries
Sequence Alignment
Pairwise and multiple alignment using dynamic programming.
from skbio.alignment import local_pairwise_align_ssw, TabularMSA
# Local alignment (Smith-Waterman)
result = local_pairwise_align_ssw(query_seq, target_seq)
# Load existing alignment
alignment = TabularMSA.read('msa.fasta', constructor=skbio.DNA)
# Derive consensus
consensus_seq = alignment.consensus()
Notes:
local_pairwise_align_sswprovides fast SSW-based local alignmentStripedSmithWatermanhandles protein sequences with substitution matrices- Affine gap penalties suit biological sequences best
Phylogenetic Trees
Construct and analyze evolutionary trees.
from skbio import TreeNode
from skbio.tree import nj, upgma
# Build from distances
phylogeny = nj(distance_matrix)
# Load existing tree
phylogeny = TreeNode.read('species.nwk')
# Extract subset
clade = phylogeny.shear(['mouse', 'rat', 'human'])
# Enumerate leaf nodes
leaves = list(phylogeny.tips())
# Common ancestor
ancestor = phylogeny.lowest_common_ancestor(['mouse', 'rat'])
# Branch length between taxa
branch_dist = phylogeny.find('mouse').distance(phylogeny.find('rat'))
# Pairwise distances for all tips
pairwise_dm = phylogeny.cophenetic_matrix()
# Topology comparison
rf_diff = phylogeny.robinson_foulds(other_tree)
Tree construction methods:
| Method | Use case |
|---|---|
nj() |
Standard neighbor-joining |
upgma() |
Assumes molecular clock |
bme() |
Scalable for large datasets |
Diversity Analysis
Calculate ecological diversity metrics.
Alpha Diversity (within-sample)
from skbio.diversity import alpha_diversity
# Sample abundance matrix
abundances = np.array([
[45, 12, 0, 8],
[5, 0, 33, 17],
[20, 20, 15, 10]
])
samples = ['gut_1', 'gut_2', 'gut_3']
# Richness and evenness metrics
shannon_vals = alpha_diversity('shannon', abundances, ids=samples)
simpson_vals = alpha_diversity('simpson', abundances, ids=samples)
# Phylogenetic diversity (requires tree)
faith_vals = alpha_diversity('faith_pd', abundances, ids=samples,
tree=phylogeny, otu_ids=feature_names)
Beta Diversity (between-sample)
from skbio.diversity import beta_diversity
# Distance matrices
bray_dm = beta_diversity('braycurtis', abundances, ids=samples)
unifrac_dm = beta_diversity('weighted_unifrac', abundances, ids=samples,
tree=phylogeny, otu_ids=feature_names)
Key points:
- Input must be integer counts, not proportions
- Phylogenetic metrics require a tree matching feature IDs
partial_beta_diversity()computes specific sample pairs efficiently
Ordination
Project high-dimensional data to visualizable spaces.
from skbio.stats.ordination import pcoa, cca
# PCoA from distance matrix
coords = pcoa(bray_dm)
axis1 = coords.samples['PC1']
axis2 = coords.samples['PC2']
variance_explained = coords.proportion_explained
# CCA with environmental predictors
constrained = cca(species_abundances, environmental_vars)
Methods:
| Function | Input | Purpose |
|---|---|---|
pcoa() |
Distance matrix | Unconstrained ordination |
cca() |
Abundance + environment | Constrained ordination (unimodal) |
rda() |
Abundance + environment | Constrained ordination (linear) |
Statistical Tests
Hypothesis testing for ecological data.
from skbio.stats.distance import permanova, anosim, mantel
# Group comparison
treatment_groups = ['control', 'control', 'treated', 'treated']
perm_result = permanova(bray_dm, treatment_groups, permutations=999)
print(f"F = {perm_result['test statistic']:.3f}, p = {perm_result['p-value']:.4f}")
# Alternative group test
anos_result = anosim(bray_dm, treatment_groups, permutations=999)
# Matrix correlation
r, pval, n = mantel(genetic_dm, geographic_dm, method='spearman', permutations=999)
print(f"r = {r:.3f}, p = {pval:.4f}")
Test overview:
| Test | Purpose | Key output |
|---|---|---|
| PERMANOVA | Group differences | F-statistic, p-value |
| ANOSIM | Group differences (alternative) | R-statistic, p-value |
| PERMDISP | Dispersion homogeneity | Tests PERMANOVA assumption |
| Mantel | Matrix correlation | Correlation coefficient, p-value |
File I/O
Read and write 19+ biological formats.
import skbio
# Automatic format detection
tree = skbio.TreeNode.read('phylogeny.nwk')
# Memory-efficient iteration
for record in skbio.io.read('reads.fastq', format='fastq', constructor=skbio.DNA):
if record.positional_metadata['quality'].mean() > 30:
process(record)
# Format conversion
records = skbio.io.read('sequences.fastq', format='fastq', constructor=skbio.DNA)
skbio.io.write(records, format='fasta', into='sequences.fasta')
Supported formats:
| Category | Formats |
|---|---|
| Sequences | FASTA, FASTQ, GenBank, EMBL, QSeq |
| Alignments | Clustal, PHYLIP, Stockholm |
| Trees | Newick |
| Tables | BIOM (HDF5/JSON) |
| Distances | Delimited matrices |
Distance Matrices
Store and manipulate pairwise distances.
from skbio import DistanceMatrix
import numpy as np
# Create from array
distances = np.array([
[0.0, 0.3, 0.7],
[0.3, 0.0, 0.5],
[0.7, 0.5, 0.0]
])
dm = DistanceMatrix(distances, ids=['sp_A', 'sp_B', 'sp_C'])
# Access elements
pair_dist = dm['sp_A', 'sp_B']
all_from_a = dm['sp_A']
# Subset
subset_dm = dm.filter(['sp_A', 'sp_C'])
Feature Tables (BIOM)
Handle OTU/ASV abundance tables.
from skbio import Table
# Load table
tbl = Table.read('features.biom')
# Inspect structure
sample_names = tbl.ids(axis='sample')
feature_names = tbl.ids(axis='observation')
# Filter by abundance
filtered = tbl.filter(lambda row, id_, md: row.sum() > 500, axis='sample')
# Convert to pandas
df = tbl.to_dataframe()
Protein Embeddings
Bridge language model outputs with scikit-bio analysis.
from skbio.embedding import ProteinEmbedding
# Load embeddings (from ESM, ProtTrans, etc.)
emb = ProteinEmbedding(embedding_matrix, protein_ids)
# Create distance matrix for downstream analysis
emb_dm = emb.to_distances(metric='cosine')
# Ordination visualization
emb_pcoa = emb.to_ordination(metric='euclidean', method='pcoa')
Typical Workflows
Microbiome diversity study:
- Load BIOM table and phylogenetic tree
- Calculate alpha diversity per sample
- Compute beta diversity (UniFrac)
- Ordinate with PCoA
- Test group differences with PERMANOVA
Phylogenetic inference:
- Read sequences from FASTA
- Perform multiple alignment
- Calculate pairwise distances
- Construct tree with neighbor-joining
- Analyze clade relationships
Sequence processing:
- Read FASTQ with quality scores
- Filter low-quality reads
- Search for motifs
- Translate to protein
- Export as FASTA
Performance Tips
- Use generators for large sequence files
- Prefer BIOM HDF5 over JSON for big tables
- Apply
partial_beta_diversity()when computing only specific pairs - Choose BME for very large phylogenies
Ecosystem Integration
| Library | Integration |
|---|---|
| pandas | DataFrames from distance matrices, diversity results |
| numpy | Array conversions throughout |
| matplotlib/seaborn | Plot ordination results, heatmaps |
| scikit-learn | Distance matrices as input |
| QIIME 2 | Native BIOM, tree, distance matrix compatibility |
Reference Files
| File | Contents |
|---|---|
| references/api-reference.md | Complete method signatures, parameters, extended examples, and troubleshooting |
More from aminoanalytica/amina-skills
uniprot-database
Query and retrieve protein sequences, annotations, and functional data from UniProt. Supports text search, ID mapping between databases, batch downloads, and access to Swiss-Prot (reviewed) and TrEMBL (predicted) entries.
29rdkit
Python cheminformatics library for molecular manipulation and analysis. Parse SMILES/SDF/MOL formats, compute descriptors (MW, LogP, TPSA), generate fingerprints (Morgan, MACCS), perform substructure queries with SMARTS, create 2D/3D geometries, calculate similarity, and run chemical reactions.
28biorxiv-database
Search and retrieve preprints from bioRxiv. Use when asked to "search bioRxiv", "find preprints", "look up bioRxiv papers", or retrieve life sciences literature.
28pdb-database
Query and retrieve protein/nucleic acid structures from RCSB PDB. Use when you need to search the PDB database for structures or metadata. Supports text, sequence, and structure-based searches, coordinate downloads, and metadata retrieval for structural biology workflows.
28chembl-database
Query the ChEMBL database for bioactive compounds, drug targets, and bioactivity data. Use this skill when searching for small molecules, finding inhibitors for protein targets, or analyzing drug mechanisms of action.
28amina-init
>
28