protein-function-prediction
SKILL.md
Protein Function Prediction
Predict functional annotations and properties for proteins from their amino acid sequences using the BioT5 model.
When to Use
- You have a protein FASTA sequence and need to understand its biological role
- You want to identify enzyme function, pathway involvement, or molecular mechanisms
- You need quick functional insights without experimental data
- You're characterizing novel or unannotated protein sequences
Workflow
from open_biomed.data import Protein, Text
from open_biomed.core.pipeline import InferencePipeline
# Create protein from FASTA sequence
protein = Protein.from_fasta("YOUR_AMINO_ACID_SEQUENCE")
# Create the question for functional annotation
question = Text.from_str(
"Inspect the protein sequence and offer a concise description of its properties."
)
# Load the BioT5 model for protein question answering
pipeline = InferencePipeline(
task="protein_question_answering",
model="biot5",
model_ckpt="./checkpoints/server/protein_question_answering_biot5.ckpt",
device="cuda:0"
)
# Run inference to get functional annotation
outputs = pipeline.run(protein=protein, text=question)
function_description = outputs[0][0].str
print(function_description)
See examples/basic_example.py for a complete runnable script.
Expected Outputs
The model returns a text description that typically includes:
| Output Component | Example |
|---|---|
| Enzyme name | Phosphoribosylformylglycinamidine synthase |
| Biological pathway | Purine biosynthesis pathway |
| Catalytic activity | FGAR to FGAM conversion |
| Complex membership | Part of FGAM synthase complex (PurQ, PurL, PurS) |
| Mechanism details | ATP-dependent, glutamine amidotransferase activity |
Example Output
Part of the phosphoribosylformylglycinamidine synthase complex involved in the purines biosynthetic pathway. Catalyzes the ATP-dependent conversion of formylglycinamide ribonucleotide (FGAR) and glutamine to yield formylglycinamidine ribonucleotide (FGAM) and glutamate.
Input Formats
The skill accepts protein sequences in FASTA format (amino acid string):
# From raw sequence string
protein = Protein.from_fasta("MRVGVIRFPGSNCDRDVHHVLELAGAEPEYVWW...")
# From UniProt (get sequence first)
from open_biomed.tools.tool_registry import TOOLS
tool = TOOLS["protein_uniprot_request"]
protein, _ = tool.run(accession="P00533") # Example: EGFR
Error Handling
| Error | Cause | Solution |
|---|---|---|
FileNotFoundError |
Model checkpoint not found | Download checkpoint to ./checkpoints/server/ |
CUDA out of memory |
GPU memory insufficient | Use smaller batch or CPU device |
Sequence too long |
Exceeds 512 amino acid limit | Truncate sequence or use sliding window |
Model Details
- Model: BioT5 (protein-text foundation model)
- Max sequence length: 512 amino acids
- Inference time: ~2-3 seconds per sequence on GPU
- Capabilities: Function prediction, property description, pathway annotation
Limitations
- Sequences longer than 512 residues are truncated
- Model trained on known proteins; novel folds may have lower accuracy
- Does not predict 3D structure or binding sites (use
protein_foldingorprotein_binding_site_predictiontools)
Related Skills
protein-structure-design-boltzgen: For 3D structure predictionprotein-mutation-analysis: For mutation effect predictionuniprot-query: For retrieving protein metadata from UniProt
Weekly Installs
2
Repository
pharmolix/openbiomedGitHub Stars
1.0K
First Seen
10 days ago
Security Audits
Installed on
trae-cn2
iflow-cli2
deepagents2
antigravity2
claude-code2
github-copilot2