biopython
Biopython: Python Tools for Computational Biology
Summary
Biopython (v1.85+) delivers a comprehensive Python library for biological data analysis. It requires Python 3 and NumPy, providing modular components for sequences, alignments, database access, BLAST, structures, and phylogenetics.
Applicable Scenarios
This skill applies when you need to:
| Task Category | Examples |
|---|---|
| Sequence Operations | Create, modify, translate DNA/RNA/protein sequences |
| File Format Handling | Parse or convert FASTA, GenBank, FASTQ, PDB, mmCIF |
| NCBI Database Access | Query GenBank, PubMed, Protein, Gene, Taxonomy |
| Similarity Searches | Execute BLAST locally or via NCBI, parse results |
| Alignment Work | Pairwise or multiple sequence alignments |
| Structural Analysis | Parse PDB files, compute distances, DSSP assignment |
| Tree Construction | Build, manipulate, visualize phylogenetic trees |
| Motif Discovery | Find and score sequence patterns |
| Sequence Statistics | GC content, molecular weight, melting temperature |
Module Organization
| Module | Purpose | Reference |
|---|---|---|
| Bio.Seq / Bio.SeqIO | Sequence objects and file I/O | references/sequence-io.md |
| Bio.Align / Bio.AlignIO | Pairwise and multiple alignments | references/alignment.md |
| Bio.Entrez | NCBI database programmatic access | references/databases.md |
| Bio.Blast | BLAST execution and result parsing | references/blast.md |
| Bio.PDB | 3D structure manipulation | references/structure.md |
| Bio.Phylo | Phylogenetic tree operations | references/phylogenetics.md |
| Bio.motifs, Bio.SeqUtils, etc. | Motifs, utilities, restriction sites | references/advanced.md |
Setup
Install via pip:
uv pip install biopython
Configure NCBI access (mandatory for Entrez operations):
from Bio import Entrez
Entrez.email = "researcher@institution.edu"
Entrez.api_key = "your_ncbi_api_key" # Optional: increases rate limit to 10 req/s
Quick Reference
Parse Sequences
from Bio import SeqIO
records = SeqIO.parse("data.fasta", "fasta")
for rec in records:
print(f"{rec.id}: {len(rec)} bp")
Translate DNA
from Bio.Seq import Seq
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
protein = dna.translate()
Query NCBI
from Bio import Entrez
Entrez.email = "researcher@institution.edu"
handle = Entrez.esearch(db="nucleotide", term="insulin[Gene] AND human[Organism]")
results = Entrez.read(handle)
handle.close()
Run BLAST
from Bio.Blast import NCBIWWW, NCBIXML
result = NCBIWWW.qblast("blastp", "swissprot", "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH")
record = NCBIXML.read(result)
Parse Protein Structure
from Bio.PDB import PDBParser
parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", "structure.pdb")
for atom in structure.get_atoms():
print(atom.name, atom.coord)
Build Phylogenetic Tree
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
alignment = AlignIO.read("aligned.fasta", "fasta")
calc = DistanceCalculator("identity")
dm = calc.get_distance(alignment)
tree = DistanceTreeConstructor().nj(dm)
Phylo.draw_ascii(tree)
Reference Files
| File | Contents |
|---|---|
references/sequence-io.md |
Bio.Seq objects, SeqIO parsing/writing, large file handling, format conversion |
references/alignment.md |
Pairwise alignment, BLOSUM matrices, AlignIO, external aligners |
references/databases.md |
NCBI Entrez API, esearch/efetch/elink, batch downloads, search syntax |
references/blast.md |
Remote/local BLAST, XML parsing, result filtering, batch queries |
references/structure.md |
Bio.PDB, SMCRA hierarchy, DSSP, superimposition, spatial queries |
references/phylogenetics.md |
Tree I/O, distance matrices, tree construction, consensus, visualization |
references/advanced.md |
Motifs, SeqUtils, restriction enzymes, population genetics, GenomeDiagram |
Implementation Patterns
Retrieve and Analyze GenBank Record
from Bio import Entrez, SeqIO
from Bio.SeqUtils import gc_fraction
Entrez.email = "researcher@institution.edu"
handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
print(f"Organism: {record.annotations['organism']}")
print(f"Length: {len(record)} bp")
print(f"GC: {gc_fraction(record.seq):.1%}")
Batch Sequence Processing
from Bio import SeqIO
from Bio.SeqUtils import gc_fraction
output_records = []
for record in SeqIO.parse("input.fasta", "fasta"):
if len(record) >= 200 and gc_fraction(record.seq) > 0.4:
output_records.append(record)
SeqIO.write(output_records, "filtered.fasta", "fasta")
BLAST with Result Filtering
from Bio.Blast import NCBIWWW, NCBIXML
query = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
result_handle = NCBIWWW.qblast("blastp", "nr", query, hitlist_size=20)
record = NCBIXML.read(result_handle)
for alignment in record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 1e-10:
identity_pct = (hsp.identities / hsp.align_length) * 100
print(f"{alignment.accession}: {identity_pct:.1f}% identity, E={hsp.expect:.2e}")
Phylogeny from Alignment
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import matplotlib.pyplot as plt
alignment = AlignIO.read("sequences.aln", "clustal")
calculator = DistanceCalculator("blosum62")
dm = calculator.get_distance(alignment)
constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)
tree.root_at_midpoint()
tree.ladderize()
fig, ax = plt.subplots(figsize=(12, 8))
Phylo.draw(tree, axes=ax)
fig.savefig("phylogeny.png", dpi=150)
Guidelines
Imports: Use explicit imports
from Bio import SeqIO, Entrez
from Bio.Seq import Seq
File Handling: Always close handles or use context managers
with open("sequences.fasta") as f:
for record in SeqIO.parse(f, "fasta"):
process(record)
Memory Efficiency: Use iterators for large datasets
# Correct: iterate without loading all
for record in SeqIO.parse("huge.fasta", "fasta"):
if meets_criteria(record):
yield record
# Avoid: loading entire file
all_records = list(SeqIO.parse("huge.fasta", "fasta"))
Error Handling: Wrap network operations
from urllib.error import HTTPError
try:
handle = Entrez.efetch(db="nucleotide", id=accession)
record = SeqIO.read(handle, "genbank")
except HTTPError as e:
print(f"Fetch failed: {e.code}")
NCBI Compliance: Set email, respect rate limits, cache downloads locally
Troubleshooting
| Issue | Resolution |
|---|---|
| "No handlers could be found for logger 'Bio.Entrez'" | Set Entrez.email before any queries |
| HTTP 400 from NCBI | Verify accession/ID format is correct |
| "ValueError: EOF" during parse | Confirm file format matches format string |
| Alignment length mismatch | Sequences must be pre-aligned for AlignIO |
| Slow BLAST queries | Use local BLAST for large-scale searches |
| PDB parser warnings | Use PDBParser(QUIET=True) or check structure quality |
External Resources
- Biopython Documentation: https://biopython.org/docs/latest/
- Biopython Tutorial: https://biopython.org/docs/latest/Tutorial/
- GitHub Repository: https://github.com/biopython/biopython
More from aminoanalytica/amina-skills
uniprot-database
Query and retrieve protein sequences, annotations, and functional data from UniProt. Supports text search, ID mapping between databases, batch downloads, and access to Swiss-Prot (reviewed) and TrEMBL (predicted) entries.
29rdkit
Python cheminformatics library for molecular manipulation and analysis. Parse SMILES/SDF/MOL formats, compute descriptors (MW, LogP, TPSA), generate fingerprints (Morgan, MACCS), perform substructure queries with SMARTS, create 2D/3D geometries, calculate similarity, and run chemical reactions.
28biorxiv-database
Search and retrieve preprints from bioRxiv. Use when asked to "search bioRxiv", "find preprints", "look up bioRxiv papers", or retrieve life sciences literature.
28scikit-bio
Python bioinformatics library for sequence manipulation, alignments, phylogenetics, diversity metrics (Shannon, UniFrac), ordination (PCoA, CCA), statistical tests (PERMANOVA, Mantel), and biological file format I/O.
28pdb-database
Query and retrieve protein/nucleic acid structures from RCSB PDB. Use when you need to search the PDB database for structures or metadata. Supports text, sequence, and structure-based searches, coordinate downloads, and metadata retrieval for structural biology workflows.
28chembl-database
Query the ChEMBL database for bioactive compounds, drug targets, and bioactivity data. Use this skill when searching for small molecules, finding inhibitors for protein targets, or analyzing drug mechanisms of action.
28