tooluniverse-phylogenetics

Originally frommims-harvard/tooluniverse

Installation

SKILL.md

Phylogenetics and Sequence Analysis

Comprehensive phylogenetics and sequence analysis using PhyKIT, Biopython, and DendroPy. Designed for bioinformatics questions about multiple sequence alignments, phylogenetic trees, parsimony, molecular evolution, and comparative genomics.

IMPORTANT: This skill handles complex phylogenetic workflows. Most implementation details have been moved to references/ for progressive disclosure. This document focuses on high-level decision-making and workflow orchestration.

When to Use This Skill

Apply when users:

Have FASTA alignment files and ask about parsimony informative sites, gaps, or alignment quality
Have Newick tree files and ask about treeness, tree length, evolutionary rate, or DVMC
Ask about treeness/RCV, RCV, or relative composition variability
Need to compare phylogenetic metrics between groups (fungi vs animals, etc.)
Ask about PhyKIT functions (treeness, rcv, dvmc, evo_rate, parsimony_informative, tree_length)
Have gene family data with paired alignments and trees
Need Mann-Whitney U tests or other statistical comparisons of phylogenetic metrics
Ask about bootstrap support, branch lengths, or tree topology
Need to build trees (NJ, UPGMA, parsimony) from alignments
Ask about Robinson-Foulds distance or tree comparison

BixBench Coverage: 33 questions across 8 projects (bix-4, bix-11, bix-12, bix-25, bix-35, bix-38, bix-45, bix-60)

NOT for (use other skills instead):

Multiple sequence alignment generation → Use external tools (MUSCLE, MAFFT, ClustalW)
Maximum Likelihood tree construction → Use IQ-TREE, RAxML, or PhyML
Bayesian phylogenetics → Use MrBayes or BEAST
Ancestral state reconstruction → Use separate tools

Core Principles

Data-first approach - Discover and validate all input files (alignments, trees) before any analysis
PhyKIT-compatible - Use PhyKIT functions for treeness, RCV, DVMC, parsimony, evolutionary rate (matches BixBench expected outputs)
Format-flexible - Support FASTA, PHYLIP, Nexus, Newick, and auto-detect formats
Batch processing - Process hundreds of gene alignments/trees in a single analysis
Statistical rigor - Mann-Whitney U, medians, percentiles, standard deviations with scipy.stats
Precision awareness - Match rounding to 4 decimal places (PhyKIT default) or as requested
Group comparison - Compare metrics between taxa groups (e.g., fungi vs animals)
Question-driven - Parse exactly what is asked and return the specific number/statistic

Required Python Packages

# Core (MUST be installed)
import numpy as np
import pandas as pd
from scipy import stats
from Bio import AlignIO, Phylo, SeqIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

# PhyKIT (primary computation engine)
from phykit.services.tree.treeness import Treeness
from phykit.services.tree.total_tree_length import TotalTreeLength
from phykit.services.tree.evolutionary_rate import EvolutionaryRate
from phykit.services.tree.dvmc import DVMC
from phykit.services.tree.treeness_over_rcv import TreenessOverRCV
from phykit.services.alignment.parsimony_informative_sites import ParsimonyInformative
from phykit.services.alignment.rcv import RelativeCompositionVariability

# DendroPy (for advanced tree operations)
import dendropy

# ToolUniverse (for sequence retrieval)
from tooluniverse import ToolUniverse

Installation:

pip install phykit dendropy biopython pandas numpy scipy

High-Level Workflow Decision Tree

START: User question about phylogenetic data
│
├─ Q1: What type of analysis is needed?
│  │
│  ├─ ALIGNMENT ANALYSIS (FASTA/PHYLIP files)
│  │  ├─ Parsimony informative sites → phykit_parsimony_informative()
│  │  ├─ RCV score → phykit_rcv()
│  │  ├─ Gap percentage → alignment_gap_percentage()
│  │  ├─ GC content → alignment_statistics()
│  │  └─ See: references/sequence_alignment.md
│  │
│  ├─ TREE ANALYSIS (Newick files)
│  │  ├─ Treeness → phykit_treeness()
│  │  ├─ Tree length → phykit_tree_length()
│  │  ├─ Evolutionary rate → phykit_evolutionary_rate()
│  │  ├─ DVMC → phykit_dvmc()
│  │  ├─ Bootstrap support → extract_bootstrap_support()
│  │  └─ See: references/tree_building.md
│  │
│  ├─ COMBINED ANALYSIS (alignment + tree)
│  │  └─ Treeness/RCV → phykit_treeness_over_rcv()
│  │
│  ├─ TREE CONSTRUCTION (build from alignment)
│  │  ├─ Neighbor-Joining → build_nj_tree()
│  │  ├─ UPGMA → build_upgma_tree()
│  │  ├─ Parsimony → build_parsimony_tree()
│  │  └─ See: references/tree_building.md
│  │
│  ├─ GROUP COMPARISON (fungi vs animals, etc.)
│  │  ├─ Batch compute metrics per group
│  │  ├─ Mann-Whitney U test
│  │  ├─ Summary statistics (median, mean, percentiles)
│  │  └─ See: references/parsimony_analysis.md
│  │
│  └─ TREE COMPARISON
│     ├─ Robinson-Foulds distance → robinson_foulds_distance()
│     └─ Bootstrap consensus → bootstrap_analysis()
│
├─ Q2: What data format is available?
│  ├─ FASTA (.fa, .fasta, .faa, .fna)
│  ├─ PHYLIP (.phy, .phylip) - Use phylip-relaxed for long names
│  ├─ Nexus (.nex, .nexus)
│  ├─ Newick (.nwk, .newick, .tre, .tree)
│  └─ Auto-detect with load_alignment() or load_tree()
│
└─ Q3: Is this a batch analysis?
   ├─ Single gene → Run metric function once
   ├─ Multiple genes → Use batch_compute_metric()
   └─ Group comparison → Use discover_gene_files() + compare_groups()

Quick Reference: Common Metrics

Metric	Function	Input	Description
Treeness	`phykit_treeness(tree_file)`	Newick	Internal branch length / Total branch length
RCV	`phykit_rcv(aln_file)`	FASTA/PHYLIP	Relative Composition Variability
Treeness/RCV	`phykit_treeness_over_rcv(tree, aln)`	Both	Treeness divided by RCV
Tree Length	`phykit_tree_length(tree_file)`	Newick	Sum of all branch lengths
Evolutionary Rate	`phykit_evolutionary_rate(tree_file)`	Newick	Total branch length / num terminals
DVMC	`phykit_dvmc(tree_file)`	Newick	Degree of Violation of Molecular Clock
Parsimony Sites	`phykit_parsimony_informative(aln_file)`	FASTA/PHYLIP	Sites with ≥2 chars appearing ≥2 times
Gap Percentage	`alignment_gap_percentage(aln_file)`	FASTA/PHYLIP	Percentage of gap characters

See scripts/tree_statistics.py for implementation.

Common Analysis Patterns (BixBench)

Pattern 1: Single Metric Across Groups

Question: "What is the median DVMC for fungi vs animals?"

Workflow:

# 1. Discover files
fungi_genes = discover_gene_files("data/fungi")
animal_genes = discover_gene_files("data/animals")

# 2. Compute metric
fungi_dvmc = batch_dvmc(fungi_genes)
animal_dvmc = batch_dvmc(animal_genes)

# 3. Compare
fungi_values = list(fungi_dvmc.values())
animal_values = list(animal_dvmc.values())

print(f"Fungi median DVMC: {np.median(fungi_values):.4f}")
print(f"Animal median DVMC: {np.median(animal_values):.4f}")

See: references/parsimony_analysis.md for full implementation

Pattern 2: Statistical Comparison

Question: "What is the Mann-Whitney U statistic comparing treeness between groups?"

Workflow:

from scipy import stats

# Compute treeness for both groups
group1_treeness = batch_treeness(group1_genes)
group2_treeness = batch_treeness(group2_genes)

# Mann-Whitney U test (two-sided)
u_stat, p_value = stats.mannwhitneyu(
    list(group1_treeness.values()),
    list(group2_treeness.values()),
    alternative='two-sided'
)

print(f"U statistic: {u_stat:.0f}")
print(f"P-value: {p_value:.4e}")

Pattern 3: Filtering + Metric

Question: "What is the treeness/RCV for alignments with <5% gaps?"

Workflow:

# 1. Filter by gap percentage
valid_genes = []
for entry in gene_files:
    if 'aln_file' in entry:
        gap_pct = alignment_gap_percentage(entry['aln_file'])
        if gap_pct < 5.0:
            valid_genes.append(entry)

# 2. Compute metric on filtered set
results = batch_treeness_over_rcv(valid_genes)

# 3. Report
values = [r[0] for r in results.values()]  # treeness/rcv ratio
print(f"Median treeness/RCV: {np.median(values):.4f}")

Pattern 4: Specific Gene Lookup

Question: "What is the evolutionary rate for gene X?"

Workflow:

# Find gene file
gene_files = discover_gene_files("data/")
gene_entry = [g for g in gene_files if g['gene_id'] == 'X'][0]

# Compute metric
evo_rate = phykit_evolutionary_rate(gene_entry['tree_file'])

print(f"Evolutionary rate for gene X: {evo_rate:.4f}")

Choosing Methods: When to Use What

Alignment Methods

When building alignments (use external tools, not this skill):

Method	Speed	Accuracy	Use Case
ClustalW	Slow	Medium	Small datasets (<100 sequences), educational
MUSCLE	Fast	High	Medium datasets (100-1000 sequences)
MAFFT	Very Fast	Very High	Recommended - Large datasets (>1000 sequences)

For this skill: Work with pre-aligned sequences. Use load_alignment() to read any format.

Tree Building Methods

When to use which tree method:

Method	Speed	Accuracy	Use Case
Neighbor-Joining	Fast	Medium	Quick trees, large datasets, exploratory
UPGMA	Fast	Low	Assumes molecular clock, special cases only
Maximum Parsimony	Medium	Medium	Small datasets, discrete characters
Maximum Likelihood	Slow	High	Use external tools (IQ-TREE, RAxML) for production

Implementation in this skill:

# Fast distance-based trees
tree = build_nj_tree("alignment.fa")  # Neighbor-Joining
tree = build_upgma_tree("alignment.fa")  # UPGMA

# Parsimony (for small alignments)
tree = build_parsimony_tree("alignment.fa")

For production ML trees: Use IQ-TREE or RAxML externally, then analyze with this skill.

See references/tree_building.md for detailed implementations.

Batch Processing

Discovering Gene Files

# Auto-discover paired alignment + tree files
gene_files = discover_gene_files("data/")

# Result: list of dicts with 'gene_id', 'aln_file', 'tree_file'
# [
#   {'gene_id': 'gene1', 'aln_file': 'gene1.fa', 'tree_file': 'gene1.nwk'},
#   {'gene_id': 'gene2', 'aln_file': 'gene2.fa', 'tree_file': 'gene2.nwk'},
#   ...
# ]

Computing Metrics in Batch

# Tree metrics
treeness_results = batch_treeness(gene_files)
tree_length_results = batch_tree_length(gene_files)
dvmc_results = batch_dvmc(gene_files)
evo_rate_results = batch_evolutionary_rate(gene_files)

# Alignment metrics
rcv_results = batch_rcv(gene_files)
pi_results = batch_parsimony_informative(gene_files)
gap_results = batch_gap_percentage(gene_files)

# Combined metrics
treeness_rcv_results = batch_treeness_over_rcv(gene_files)

# All return dict: {gene_id: value}

Statistical Analysis

# Summary statistics
stats = summary_stats(list(treeness_results.values()))
# Returns: {'mean': ..., 'median': ..., 'std': ..., 'min': ..., 'max': ...}

# Group comparison
comparison = compare_groups(
    list(fungi_treeness.values()),
    list(animal_treeness.values()),
    group1_name="Fungi",
    group2_name="Animals"
)
# Returns: {'u_statistic': ..., 'p_value': ..., 'Fungi': {...}, 'Animals': {...}}

See references/parsimony_analysis.md for full workflow.

Answer Extraction for BixBench

Question Pattern	Extraction Method
"What is the median X?"	`np.median(values)`
"What is the maximum X?"	`np.max(values)`
"What is the difference between median X for A vs B?"	`abs(np.median(a) - np.median(b))`
"What percentage of X have Y above Z?"	`sum(v > Z for v in values) / len(values) * 100`
"What is the Mann-Whitney U statistic?"	`stats.mannwhitneyu(a, b)[0]`
"What is the p-value?"	`stats.mannwhitneyu(a, b)[1]`
"What is the X value for gene Y?"	`results[gene_id]`
"What is the fold-change in median X?"	`np.median(a) / np.median(b)`
"multiplied by 1000"	`round(value * 1000)`

Rounding Rules

PhyKIT default: 4 decimal places
Percentages: Match question format (e.g., "35%" → integer, "3.5%" → 1 decimal)
P-values: Scientific notation for very small values
U statistics: Integer (no decimals)
Always check question wording: "rounded to 3 decimal places" overrides defaults

BixBench Question Coverage

Project	Questions	Metrics
bix-4	7	DVMC analysis (fungi vs animals)
bix-11	6	Treeness analysis (median, percentages, Mann-Whitney U)
bix-12	5	Parsimony informative sites (counts, percentages, ratios)
bix-25	2	Treeness/RCV with gap filtering
bix-35	4	Evolutionary rate (specific genes, comparisons)
bix-38	5	Tree length (fold-change, variance, paired ratios)
bix-45	4	RCV (Mann-Whitney U, medians, paired differences)
bix-60	1	Average treeness across multiple trees

ToolUniverse Integration

Sequence Retrieval

from tooluniverse import ToolUniverse

tu = ToolUniverse()
tu.load_tools()

# Get sequences from NCBI
result = tu.tools.NCBI_get_sequence(accession="NP_000546")

# Get gene tree from Ensembl
tree_result = tu.tools.EnsemblCompara_get_gene_tree(gene="ENSG00000141510")

# Get species tree from OpenTree
tree_result = tu.tools.OpenTree_get_induced_subtree(ott_ids="770315,770319")

File Structure

tooluniverse-phylogenetics/
├── SKILL.md                           # This file (workflow orchestration)
├── QUICK_START.md                     # Quick reference
├── test_phylogenetics.py             # Comprehensive test suite
├── references/
│   ├── sequence_alignment.md         # Alignment analysis details
│   ├── tree_building.md              # Tree construction methods
│   ├── parsimony_analysis.md         # Statistical comparison workflows
│   └── troubleshooting.md            # Common issues and solutions
└── scripts/
    ├── format_alignment.py           # Alignment format conversion
    └── tree_statistics.py            # Core metric implementations

Completeness Checklist

Before returning your answer, verify:

Next Steps

For detailed alignment analysis workflows → See references/sequence_alignment.md
For tree construction methods → See references/tree_building.md
For statistical comparison examples → See references/parsimony_analysis.md
For common errors and solutions → See references/troubleshooting.md
For script implementations → See scripts/tree_statistics.py

Support

For issues with:

PhyKIT functions: Check PhyKIT documentation at https://jlsteenwyk.com/PhyKIT/
Biopython tree/alignment parsing: See https://biopython.org/wiki/Phylo
DendroPy operations: See https://dendropy.org/
ToolUniverse integration: Check ToolUniverse documentation

License

Same as ToolUniverse framework license.

Related skills

More from wu-yc/labclaw

Installs

Repository

wu-yc/labclaw

GitHub Stars

981

First Seen

Mar 15, 2026

tooluniverse-phylogenetics

Phylogenetics and Sequence Analysis

When to Use This Skill

Core Principles

Required Python Packages

High-Level Workflow Decision Tree

Quick Reference: Common Metrics

Common Analysis Patterns (BixBench)

Pattern 1: Single Metric Across Groups

Pattern 2: Statistical Comparison

Pattern 3: Filtering + Metric

Pattern 4: Specific Gene Lookup

Choosing Methods: When to Use What

Alignment Methods

Tree Building Methods

Batch Processing

Discovering Gene Files

Computing Metrics in Batch

Statistical Analysis

Answer Extraction for BixBench

Rounding Rules

BixBench Question Coverage

ToolUniverse Integration

Sequence Retrieval

File Structure

Completeness Checklist

Next Steps

Support

License

More from wu-yc/labclaw

tooluniverse-chemical-safety

rowan

tooluniverse-drug-repurposing

rdkit

tooluniverse-clinical-guidelines

tooluniverse-protein-therapeutic-design