skills/pharmolix/openbiomed/protein-subcellular-localization-prediction-biot5

protein-subcellular-localization-prediction-biot5

SKILL.md

Protein Subcellular Localization Prediction

Predict subcellular localization for proteins from their amino acid sequences using the BioT5 model.

When to Use

  • You have a protein FASTA sequence and need to know its cellular location
  • You want to identify if a protein is cytoplasmic, nuclear, membrane-bound, or secreted
  • You need quick localization insights without experimental data
  • You're characterizing novel or unannotated protein sequences

Workflow

from open_biomed.data import Protein, Text
from open_biomed.core.pipeline import InferencePipeline

# Create protein from FASTA sequence
protein = Protein.from_fasta("YOUR_AMINO_ACID_SEQUENCE")

# Create the question for subcellular localization
question = Text.from_str(
    "Please provide information about the subcellular localization of this protein."
)

# Load the BioT5 model for protein question answering
pipeline = InferencePipeline(
    task="protein_question_answering",
    model="biot5",
    model_ckpt="./checkpoints/server/protein_question_answering_biot5.ckpt",
    device="cuda:0"
)

# Run inference to get localization prediction
outputs = pipeline.run(protein=protein, text=question)
localization = outputs[0][0].str
print(localization)

See examples/basic_example.py for a complete runnable script.

Expected Outputs

The model returns subcellular localization information:

Output Description
Cytoplasm Cytoplasmic proteins, soluble enzymes
Nucleus Nuclear proteins, transcription factors
Membrane Membrane-bound proteins, receptors
Secreted Extracellular proteins, secreted factors
Mitochondria Mitochondrial proteins
Peroxisome Peroxisomal enzymes
Endoplasmic reticulum ER-resident proteins
Golgi apparatus Golgi-localized proteins

Example Output

Cytoplasm

Input Formats

The skill accepts protein sequences in FASTA format (amino acid string):

# From raw sequence string
protein = Protein.from_fasta("MRVGVIRFPGSNCDRDVHHVLELAGAEPEYVWW...")

# From UniProt (get sequence first)
from open_biomed.tools.tool_registry import TOOLS
tool = TOOLS["protein_uniprot_request"]
protein, _ = tool.run(accession="P00533")  # Example: EGFR

Error Handling

Error Cause Solution
FileNotFoundError Model checkpoint not found Download checkpoint to ./checkpoints/server/
CUDA out of memory GPU memory insufficient Use smaller batch or CPU device
Sequence too long Exceeds 512 amino acid limit Truncate sequence or use sliding window

Model Details

  • Model: BioT5 (protein-text foundation model)
  • Max sequence length: 512 amino acids
  • Inference time: ~2-3 seconds per sequence on GPU
  • Capabilities: Subcellular localization prediction

Limitations

  • Sequences longer than 512 residues are truncated
  • Model trained on known proteins; novel sequences may have lower accuracy
  • Single localization prediction (does not handle multi-localized proteins well)

Related Skills

  • protein-function-annotation: For function prediction
  • protein-mutation-analysis: For mutation effect prediction
  • uniprot-query: For retrieving protein metadata from UniProt
Weekly Installs
2
GitHub Stars
1.0K
First Seen
10 days ago
Installed on
trae-cn2
iflow-cli2
deepagents2
antigravity2
claude-code2
github-copilot2