protein-subcellular-localization-prediction-biot5
SKILL.md
Protein Subcellular Localization Prediction
Predict subcellular localization for proteins from their amino acid sequences using the BioT5 model.
When to Use
- You have a protein FASTA sequence and need to know its cellular location
- You want to identify if a protein is cytoplasmic, nuclear, membrane-bound, or secreted
- You need quick localization insights without experimental data
- You're characterizing novel or unannotated protein sequences
Workflow
from open_biomed.data import Protein, Text
from open_biomed.core.pipeline import InferencePipeline
# Create protein from FASTA sequence
protein = Protein.from_fasta("YOUR_AMINO_ACID_SEQUENCE")
# Create the question for subcellular localization
question = Text.from_str(
"Please provide information about the subcellular localization of this protein."
)
# Load the BioT5 model for protein question answering
pipeline = InferencePipeline(
task="protein_question_answering",
model="biot5",
model_ckpt="./checkpoints/server/protein_question_answering_biot5.ckpt",
device="cuda:0"
)
# Run inference to get localization prediction
outputs = pipeline.run(protein=protein, text=question)
localization = outputs[0][0].str
print(localization)
See examples/basic_example.py for a complete runnable script.
Expected Outputs
The model returns subcellular localization information:
| Output | Description |
|---|---|
| Cytoplasm | Cytoplasmic proteins, soluble enzymes |
| Nucleus | Nuclear proteins, transcription factors |
| Membrane | Membrane-bound proteins, receptors |
| Secreted | Extracellular proteins, secreted factors |
| Mitochondria | Mitochondrial proteins |
| Peroxisome | Peroxisomal enzymes |
| Endoplasmic reticulum | ER-resident proteins |
| Golgi apparatus | Golgi-localized proteins |
Example Output
Cytoplasm
Input Formats
The skill accepts protein sequences in FASTA format (amino acid string):
# From raw sequence string
protein = Protein.from_fasta("MRVGVIRFPGSNCDRDVHHVLELAGAEPEYVWW...")
# From UniProt (get sequence first)
from open_biomed.tools.tool_registry import TOOLS
tool = TOOLS["protein_uniprot_request"]
protein, _ = tool.run(accession="P00533") # Example: EGFR
Error Handling
| Error | Cause | Solution |
|---|---|---|
FileNotFoundError |
Model checkpoint not found | Download checkpoint to ./checkpoints/server/ |
CUDA out of memory |
GPU memory insufficient | Use smaller batch or CPU device |
Sequence too long |
Exceeds 512 amino acid limit | Truncate sequence or use sliding window |
Model Details
- Model: BioT5 (protein-text foundation model)
- Max sequence length: 512 amino acids
- Inference time: ~2-3 seconds per sequence on GPU
- Capabilities: Subcellular localization prediction
Limitations
- Sequences longer than 512 residues are truncated
- Model trained on known proteins; novel sequences may have lower accuracy
- Single localization prediction (does not handle multi-localized proteins well)
Related Skills
protein-function-annotation: For function predictionprotein-mutation-analysis: For mutation effect predictionuniprot-query: For retrieving protein metadata from UniProt
Weekly Installs
2
Repository
pharmolix/openbiomedGitHub Stars
1.0K
First Seen
10 days ago
Security Audits
Installed on
trae-cn2
iflow-cli2
deepagents2
antigravity2
claude-code2
github-copilot2