bio-format-conversion
SKILL.md
Format Conversion
Convert sequence files between formats using Biopython's Bio.SeqIO module.
Required Import
from Bio import SeqIO
Core Function
SeqIO.convert() - Direct Conversion
Convert between formats in a single call. Most efficient method.
count = SeqIO.convert('input.gb', 'genbank', 'output.fasta', 'fasta')
print(f'Converted {count} records')
Parameters:
in_file- Input filename or handlein_format- Input format stringout_file- Output filename or handleout_format- Output format string
Returns: Number of records converted
Common Conversions
| From | To | Notes |
|---|---|---|
| GenBank | FASTA | Loses annotations, keeps sequence |
| FASTA | GenBank | Need to add molecule_type |
| FASTQ | FASTA | Loses quality scores |
| FASTA | FASTQ | Need to add quality scores |
| GenBank | EMBL | Usually works directly |
| Stockholm | FASTA | Alignment to sequences |
Code Patterns
Simple Conversion
SeqIO.convert('input.gb', 'genbank', 'output.fasta', 'fasta')
GenBank to FASTA
SeqIO.convert('sequence.gb', 'genbank', 'sequence.fasta', 'fasta')
FASTQ to FASTA (drop quality)
SeqIO.convert('reads.fastq', 'fastq', 'reads.fasta', 'fasta')
FASTA to GenBank (requires molecule_type)
records = SeqIO.parse('input.fasta', 'fasta')
def add_molecule_type(records):
for record in records:
record.annotations['molecule_type'] = 'DNA'
yield record
SeqIO.write(add_molecule_type(records), 'output.gb', 'genbank')
FASTA to FASTQ (add dummy quality)
def add_quality(records, quality=30):
for record in records:
record.letter_annotations['phred_quality'] = [quality] * len(record.seq)
yield record
records = SeqIO.parse('input.fasta', 'fasta')
SeqIO.write(add_quality(records), 'output.fastq', 'fastq')
Batch Convert Multiple Files
from pathlib import Path
for gb_file in Path('.').glob('*.gb'):
fasta_file = gb_file.with_suffix('.fasta')
count = SeqIO.convert(str(gb_file), 'genbank', str(fasta_file), 'fasta')
print(f'{gb_file.name}: {count} records')
Convert with Modifications
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
def uppercase_record(rec):
return SeqRecord(rec.seq.upper(), id=rec.id, description=rec.description)
records = SeqIO.parse('input.fasta', 'fasta')
modified = (uppercase_record(rec) for rec in records)
SeqIO.write(modified, 'output.fasta', 'fasta')
Alignment Format Conversion
from Bio import AlignIO
AlignIO.convert('alignment.sto', 'stockholm', 'alignment.phy', 'phylip')
Format Compatibility Matrix
Can convert directly (no modifications needed):
- GenBank <-> EMBL
- FASTA -> any format (may need annotations added)
- Any format -> FASTA (always works, may lose data)
- FASTQ -> FASTA
Requires adding data:
- FASTA -> FASTQ (need quality scores)
- FASTA -> GenBank (need molecule_type)
May lose data:
- GenBank -> FASTA (loses features, annotations)
- FASTQ -> FASTA (loses quality scores)
- Any rich format -> FASTA
Common Errors
| Error | Cause | Solution |
|---|---|---|
ValueError: missing molecule_type |
FASTA to GenBank | Add molecule_type annotation |
ValueError: missing quality scores |
FASTA to FASTQ | Add phred_quality to letter_annotations |
KeyError: 'phred_quality' |
Wrong FASTQ variant | Try 'fastq-sanger', 'fastq-illumina' |
Decision Tree
Converting formats?
├── Simple conversion (no data changes)?
│ └── Use SeqIO.convert() directly
├── Need to add annotations?
│ └── Parse, modify records, then write
├── Need to transform sequences?
│ └── Parse, apply transformation, then write
└── Multiple files?
└── Loop with SeqIO.convert() or batch generator
Related Skills
- read-sequences - Parse sequences for custom conversion logic
- write-sequences - Write converted sequences with modifications
- batch-processing - Convert multiple files at once
- compressed-files - Handle compressed input/output during conversion
- alignment-files - For SAM/BAM/CRAM conversion, use samtools view
Weekly Installs
3
Repository
gptomics/bioskillsInstalled on
windsurf2
trae2
opencode2
codex2
claude-code2
antigravity2