bio-alignment-pairwise
SKILL.md
Pairwise Sequence Alignment
Align two sequences using dynamic programming algorithms (Needleman-Wunsch for global, Smith-Waterman for local).
Required Import
from Bio.Align import PairwiseAligner
from Bio.Seq import Seq
from Bio import SeqIO
Core Concepts
| Mode | Algorithm | Use Case |
|---|---|---|
global |
Needleman-Wunsch | Full-length alignment, similar-length sequences |
local |
Smith-Waterman | Find best matching regions, different-length sequences |
Creating an Aligner
# Basic aligner with defaults
aligner = PairwiseAligner()
# Configure mode and scoring
aligner = PairwiseAligner(mode='global', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)
# For protein alignment with substitution matrix
from Bio.Align import substitution_matrices
aligner = PairwiseAligner(mode='global', substitution_matrix=substitution_matrices.load('BLOSUM62'))
Performing Alignments
seq1 = Seq('ACCGGTAACGTAG')
seq2 = Seq('ACCGTTAACGAAG')
# Get all optimal alignments
alignments = aligner.align(seq1, seq2)
print(f'Found {len(alignments)} optimal alignments')
print(alignments[0]) # Print first alignment
# Get score only (faster for large sequences)
score = aligner.score(seq1, seq2)
Alignment Output Format
target 0 ACCGGTAACGTAG 13
0 |||||.||||.|| 13
query 0 ACCGTTAACGAAG 13
Accessing Alignment Data
alignment = alignments[0]
# Basic properties
print(alignment.score) # Alignment score
print(alignment.shape) # (num_seqs, alignment_length)
print(len(alignment)) # Alignment length
# Get aligned sequences with gaps
target_aligned = alignment[0, :] # First sequence (target) with gaps
query_aligned = alignment[1, :] # Second sequence (query) with gaps
# Get coordinate mapping
print(alignment.aligned) # Array of aligned segment coordinates
print(alignment.coordinates) # Full coordinate array
Alignment Counts (Identities, Mismatches, Gaps)
alignment = alignments[0]
counts = alignment.counts()
print(f'Identities: {counts.identities}')
print(f'Mismatches: {counts.mismatches}')
print(f'Gaps: {counts.gaps}')
# Calculate percent identity
total_aligned = counts.identities + counts.mismatches
percent_identity = counts.identities / total_aligned * 100
print(f'Percent identity: {percent_identity:.1f}%')
Common Scoring Configurations
DNA/RNA Alignment
aligner = PairwiseAligner(mode='global', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)
Protein Alignment
from Bio.Align import substitution_matrices
blosum62 = substitution_matrices.load('BLOSUM62')
aligner = PairwiseAligner(mode='global', substitution_matrix=blosum62, open_gap_score=-11, extend_gap_score=-1)
Local Alignment (Find Best Region)
aligner = PairwiseAligner(mode='local', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)
Semiglobal (Overlap/Extension)
# Allow free end gaps on query (useful for primer alignment)
aligner = PairwiseAligner(mode='global')
aligner.query_left_open_gap_score = 0
aligner.query_left_extend_gap_score = 0
aligner.query_right_open_gap_score = 0
aligner.query_right_extend_gap_score = 0
Available Substitution Matrices
from Bio.Align import substitution_matrices
print(substitution_matrices.load()) # List all available matrices
# Common matrices
blosum62 = substitution_matrices.load('BLOSUM62') # General protein
blosum80 = substitution_matrices.load('BLOSUM80') # Closely related proteins
pam250 = substitution_matrices.load('PAM250') # Distantly related proteins
Working with SeqRecord Objects
from Bio import SeqIO
records = list(SeqIO.parse('sequences.fasta', 'fasta'))
seq1, seq2 = records[0].seq, records[1].seq
aligner = PairwiseAligner(mode='global', match_score=1, mismatch_score=-1)
alignments = aligner.align(seq1, seq2)
Iterating Over Multiple Alignments
# Limit number of alignments returned (memory efficient)
aligner.max_alignments = 100
for i, alignment in enumerate(alignments):
print(f'Alignment {i+1}: score={alignment.score}')
if i >= 4:
break
Substitution Matrix from Alignment
alignment = alignments[0]
substitutions = alignment.substitutions
# View as array (rows=target, cols=query)
print(substitutions)
# Access specific substitution counts
# substitutions['A', 'T'] gives count of A aligned to T
Export Alignment to Different Formats
alignment = alignments[0]
# Various output formats
print(format(alignment, 'fasta')) # FASTA format
print(format(alignment, 'clustal')) # Clustal format
print(format(alignment, 'psl')) # PSL format (BLAT)
print(format(alignment, 'sam')) # SAM format
Quick Reference: Scoring Parameters
| Parameter | Description | Typical DNA | Typical Protein |
|---|---|---|---|
match_score |
Score for identical bases | 1-2 | Use matrix |
mismatch_score |
Penalty for mismatches | -1 to -3 | Use matrix |
open_gap_score |
Cost to start a gap | -5 to -15 | -10 to -12 |
extend_gap_score |
Cost per gap extension | -0.5 to -2 | -0.5 to -1 |
substitution_matrix |
Scoring matrix | N/A | BLOSUM62 |
Common Errors
| Error | Cause | Solution |
|---|---|---|
OverflowError |
Too many optimal alignments | Set aligner.max_alignments |
| Low scores | Wrong scoring scheme | Use substitution matrix for proteins |
| No alignments in local mode | Scores all negative | Ensure match_score > 0 |
Decision Tree: Choosing Alignment Mode
Need full-length comparison?
├── Yes → Use mode='global'
│ └── Sequences similar length?
│ ├── Yes → Standard global
│ └── No → Consider semiglobal (free end gaps)
└── No → Use mode='local'
└── Find best matching regions only
Related Skills
- alignment-io - Save alignments to files in various formats
- msa-parsing - Work with multiple sequence alignments
- msa-statistics - Calculate identity, similarity metrics
- sequence-manipulation/motif-search - Pattern matching in sequences
Weekly Installs
3
Repository
gptomics/bioskillsInstalled on
windsurf2
trae2
opencode2
codex2
claude-code2
antigravity2