biopython

SKILL.md

Biopython: Python Tools for Computational Biology

Summary

Biopython (v1.85+) delivers a comprehensive Python library for biological data analysis. It requires Python 3 and NumPy, providing modular components for sequences, alignments, database access, BLAST, structures, and phylogenetics.

Applicable Scenarios

This skill applies when you need to:

Task Category Examples
Sequence Operations Create, modify, translate DNA/RNA/protein sequences
File Format Handling Parse or convert FASTA, GenBank, FASTQ, PDB, mmCIF
NCBI Database Access Query GenBank, PubMed, Protein, Gene, Taxonomy
Similarity Searches Execute BLAST locally or via NCBI, parse results
Alignment Work Pairwise or multiple sequence alignments
Structural Analysis Parse PDB files, compute distances, DSSP assignment
Tree Construction Build, manipulate, visualize phylogenetic trees
Motif Discovery Find and score sequence patterns
Sequence Statistics GC content, molecular weight, melting temperature

Module Organization

Module Purpose Reference
Bio.Seq / Bio.SeqIO Sequence objects and file I/O references/sequence-io.md
Bio.Align / Bio.AlignIO Pairwise and multiple alignments references/alignment.md
Bio.Entrez NCBI database programmatic access references/databases.md
Bio.Blast BLAST execution and result parsing references/blast.md
Bio.PDB 3D structure manipulation references/structure.md
Bio.Phylo Phylogenetic tree operations references/phylogenetics.md
Bio.motifs, Bio.SeqUtils, etc. Motifs, utilities, restriction sites references/advanced.md

Setup

Install via pip:

uv pip install biopython

Configure NCBI access (mandatory for Entrez operations):

from Bio import Entrez

Entrez.email = "researcher@institution.edu"
Entrez.api_key = "your_ncbi_api_key"  # Optional: increases rate limit to 10 req/s

Quick Reference

Parse Sequences

from Bio import SeqIO

records = SeqIO.parse("data.fasta", "fasta")
for rec in records:
    print(f"{rec.id}: {len(rec)} bp")

Translate DNA

from Bio.Seq import Seq

dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
protein = dna.translate()

Query NCBI

from Bio import Entrez

Entrez.email = "researcher@institution.edu"
handle = Entrez.esearch(db="nucleotide", term="insulin[Gene] AND human[Organism]")
results = Entrez.read(handle)
handle.close()

Run BLAST

from Bio.Blast import NCBIWWW, NCBIXML

result = NCBIWWW.qblast("blastp", "swissprot", "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH")
record = NCBIXML.read(result)

Parse Protein Structure

from Bio.PDB import PDBParser

parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", "structure.pdb")
for atom in structure.get_atoms():
    print(atom.name, atom.coord)

Build Phylogenetic Tree

from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

alignment = AlignIO.read("aligned.fasta", "fasta")
calc = DistanceCalculator("identity")
dm = calc.get_distance(alignment)
tree = DistanceTreeConstructor().nj(dm)
Phylo.draw_ascii(tree)

Reference Files

File Contents
references/sequence-io.md Bio.Seq objects, SeqIO parsing/writing, large file handling, format conversion
references/alignment.md Pairwise alignment, BLOSUM matrices, AlignIO, external aligners
references/databases.md NCBI Entrez API, esearch/efetch/elink, batch downloads, search syntax
references/blast.md Remote/local BLAST, XML parsing, result filtering, batch queries
references/structure.md Bio.PDB, SMCRA hierarchy, DSSP, superimposition, spatial queries
references/phylogenetics.md Tree I/O, distance matrices, tree construction, consensus, visualization
references/advanced.md Motifs, SeqUtils, restriction enzymes, population genetics, GenomeDiagram

Implementation Patterns

Retrieve and Analyze GenBank Record

from Bio import Entrez, SeqIO
from Bio.SeqUtils import gc_fraction

Entrez.email = "researcher@institution.edu"

handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

print(f"Organism: {record.annotations['organism']}")
print(f"Length: {len(record)} bp")
print(f"GC: {gc_fraction(record.seq):.1%}")

Batch Sequence Processing

from Bio import SeqIO
from Bio.SeqUtils import gc_fraction

output_records = []
for record in SeqIO.parse("input.fasta", "fasta"):
    if len(record) >= 200 and gc_fraction(record.seq) > 0.4:
        output_records.append(record)

SeqIO.write(output_records, "filtered.fasta", "fasta")

BLAST with Result Filtering

from Bio.Blast import NCBIWWW, NCBIXML

query = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
result_handle = NCBIWWW.qblast("blastp", "nr", query, hitlist_size=20)
record = NCBIXML.read(result_handle)

for alignment in record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect < 1e-10:
            identity_pct = (hsp.identities / hsp.align_length) * 100
            print(f"{alignment.accession}: {identity_pct:.1f}% identity, E={hsp.expect:.2e}")

Phylogeny from Alignment

from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import matplotlib.pyplot as plt

alignment = AlignIO.read("sequences.aln", "clustal")
calculator = DistanceCalculator("blosum62")
dm = calculator.get_distance(alignment)

constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)
tree.root_at_midpoint()
tree.ladderize()

fig, ax = plt.subplots(figsize=(12, 8))
Phylo.draw(tree, axes=ax)
fig.savefig("phylogeny.png", dpi=150)

Guidelines

Imports: Use explicit imports

from Bio import SeqIO, Entrez
from Bio.Seq import Seq

File Handling: Always close handles or use context managers

with open("sequences.fasta") as f:
    for record in SeqIO.parse(f, "fasta"):
        process(record)

Memory Efficiency: Use iterators for large datasets

# Correct: iterate without loading all
for record in SeqIO.parse("huge.fasta", "fasta"):
    if meets_criteria(record):
        yield record

# Avoid: loading entire file
all_records = list(SeqIO.parse("huge.fasta", "fasta"))

Error Handling: Wrap network operations

from urllib.error import HTTPError

try:
    handle = Entrez.efetch(db="nucleotide", id=accession)
    record = SeqIO.read(handle, "genbank")
except HTTPError as e:
    print(f"Fetch failed: {e.code}")

NCBI Compliance: Set email, respect rate limits, cache downloads locally

Troubleshooting

Issue Resolution
"No handlers could be found for logger 'Bio.Entrez'" Set Entrez.email before any queries
HTTP 400 from NCBI Verify accession/ID format is correct
"ValueError: EOF" during parse Confirm file format matches format string
Alignment length mismatch Sequences must be pre-aligned for AlignIO
Slow BLAST queries Use local BLAST for large-scale searches
PDB parser warnings Use PDBParser(QUIET=True) or check structure quality

External Resources

Weekly Installs
26
First Seen
Feb 25, 2026
Installed on
mcpjam26
claude-code26
replit26
junie26
windsurf26
zencoder26