mutation-design-gfp
High-Fluorescence GFP Mutant Proposal
A skill performs automated multi-round optimization of Green Fluorescent Protein (GFP) to discover mutants with higher fluorescence intensity and higher diversity.
When to Use This Skill
- Design novel GFP mutants with improved fluorescence intensity.
- Run computational iterative directed evolution.
- Perform fast mutation search guided by an oracle model.
Example prompts:
- “Design GFP mutants with higher fluorescence.”
- “Run multi-round mutation optimization for GFP.”
- “Generate 96 GFP variants with improved fluorescence.”
Prerequisites
- Python 3.9+
- PyTorch
- NumPy / Pandas
- Protein sequence analysis tools
- Protein language model tools (ESM2)
Core Capabilities
This skill can:
- Download initial GFP sequences if they were not provided by users.
- Download and execute an in-silico oracle GFP prediction model.
- Generate controllable mutants within 4 point mutations for each round.
- Use ESM2 embeddings to represent GFP sequences.
- Optimize mutation proposals based on oracle feedback.
- Maintain population diversity using average pairwise Hamming distance.
- Perform multi-round optimization and return the best mutants.
Workflow
-
Download initial GFP sequences from
https://cloud.tsinghua.edu.cn/f/5e673c1db710466b828f/?dl=1and use them as the starting pool. -
Download the oracle GFP prediction model from
https://cloud.tsinghua.edu.cn/f/f655f79d7bb04a98a0bb/?dl=1, and the configuration file fromhttps://cloud.tsinghua.edu.cn/f/8a894bb4b41f4074b9b0/?dl=1. -
Execute code for oracle loading and scoring:
import torch
from omegaconf import OmegaConf
# ===== ORACLE MODEL LOADING =====
def load_oracle_model(ckpt_path, cfg_path):
with open(cfg_path, 'r') as fp:
cfg = OmegaConf.load(fp.name)
oracle = BaseCNN(**cfg.model.predictor)
state_dict = torch.load(ckpt_path)
oracle.load_state_dict(torch.load(ckpt_path))
oracle.eval()
# ===== ORACLE SCORING FUNCTION =====
def score_sequence(oracle, sequence: str) -> float:
results = oracle(sequence).detach()
return results.cpu().numpy()
-
Compute ESM2 embeddings for all sequences to represent sequence features.
-
Proposal: for each round, propose 96 × 4 candidate mutants from the current population using only point mutations with ≤4 mutations per sequence.
-
Evaluation: evaluate all candidate sequences using the oracle scoring function. Use oracle feedback from previous rounds to bias mutation proposals toward directions that increase predicted fluorescence (fitness gradient exploitation).
-
Selection: rank sequences by predicted fitness and select the top 96 mutants, while maintaining diversity measured by average pairwise Hamming distance.
-
Repeat proposal, evaluation, and selection until 10 rounds are completed, or best fitness does not improve for 3 consecutive rounds.
-
Collect the best 96 mutants discovered across all rounds and sort them by predicted fluorescence, and export the results as a CSV file following the specified output format.
Output Format
The final result must be a CSV file with two columns:
| sequence | fitness |
|---|---|
| GFP_mutant_sequence | predicted_fluorescence |
Requirements:
- Exactly 96 sequences
- Sorted by fitness in descending order
- Sequences must be valid GFP mutants
Example:
sequence,fitness
SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTT...,0.93
SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFIATT...,0.91
SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTIKFICTT...,0.89
...
This CSV represents the final optimized GFP mutant library predicted to exhibit higher fluorescence intensity.