amina-de-novo-protein-binder-design
De Novo Protein Binder Design Pipeline
Overview
This pipeline designs novel protein binders for a target protein structure through six phases:
- Target Preparation -- Clean and validate the input PDB
- Epitope Selection -- Identify binding sites via PeSTo + DBSCAN clustering
- Binder Length Optimization -- Calculate optimal binder size from epitope geometry
- Backbone Generation -- Generate candidate scaffolds with RFdiffusion
- Sequence Design -- Design sequences via inverse folding (ProteinMPNN / ESM-IF1)
- Validation -- Multi-metric screening with early termination
- Round Advancement -- Iterate or finalize based on results
Interaction Model
Phases 0--1.5 (preparation): Run autonomously -- clean the target, select epitopes, determine binder length. No user approval needed.
Before Phase 2 (backbone generation): Present a design plan to the user summarizing: target info, selected epitope and hotspot residues, proposed binder length, number of backbones, and any concerns (large target, polar epitope, etc.). Wait for user approval before proceeding.
Phases 2--5 (generation + validation): Run autonomously after approval. Report results at natural checkpoints (after each round).
Design Principles
- Fixed tool pipeline -- The tools and their order are prescribed (see each phase). Parameters and scale are flexible based on context.
- Early termination -- Run cheap checks (solubility) before expensive ones (structure prediction) to save compute.
- Iterative refinement -- Start with high-diversity sequence sampling and progressively lower temperature for convergence.
Phase 0: Target Preparation
Input: Raw PDB file (from database or user upload).
Steps
- Parse PDB for metadata -- Extract experimental method, resolution, chain composition, sequences.
- Clean structure -- Remove waters/heterogens, restore missing heavy atoms, add hydrogens at physiological pH, standardize non-canonical residues, detect disulfide bonds.
- Select chains of interest -- Extract relevant chain(s). Can be user-specified or automatic.
- Quality checks (warnings, not blockers):
- Targets >2000 residues are computationally expensive downstream
- Multi-chain targets add complexity to epitope selection
Output: Cleaned PDB with metadata (sequences, chain IDs, residue counts).
Phase 1: Epitope Selection
Purpose: Identify and rank candidate binding sites on the target surface.
Step 1: Binding Interface Prediction (PeSTo)
Run PeSTo (Protein Structure Transformer, model i_v4_1) to get per-residue protein-protein binding probabilities (0.0--1.0). Classify residues above a threshold (recommended: 0.3) as binding candidates.
Step 2: Glycosylation Filtering (Optional)
Run glycosylation prediction ensemble (EMNgly, LMNglyPred, ISOGlyP) to identify residues at risk of post-translational glycosylation. Penalize high-risk sites during scoring since glycans physically occlude the surface.
Step 3: DBSCAN Spatial Clustering
Cluster binding residues by CA coordinates using DBSCAN to group spatially proximal residues into discrete epitope regions. Discard unclustered noise residues.
Step 4: Hotspot Selection
Rank residues within each cluster by composite score that weighs:
- PeSTo binding probability (primary signal)
- Hydrophobic/aromatic character (favorable for binding interfaces)
- Glycosylation risk (penalty)
Select the best cluster and pick the top-scoring residues (recommended: 3--6) as hotspot residues for RFdiffusion.
Scoring weights and parameter defaults:
references/parameters.md-- load when adjusting epitope selection behavior.
Phase 1.5: Binder Length Optimization
Runs only if the user did not specify a custom binder length.
Approach
- Compute pairwise CA distances between all epitope residues.
- Take the maximum distance as the epitope span.
- Choose a binder length proportional to the span -- larger epitopes need longer binders to cover the interface. General guidelines:
| Epitope Span | Suggested Binder Length |
|---|---|
| < 20 A (compact) | 50--80 residues |
| 20--40 A (medium) | 70--110 residues |
| 40--60 A (large) | 90--140 residues |
| > 60 A (extended) | 120--180 residues |
Binder length should stay within 50--200 residues. When in doubt, err toward longer binders.
Phase 2: Backbone Generation (RFdiffusion)
Purpose: Generate diverse backbone structures complementary to the target epitope.
Inputs
- Cleaned target PDB from Phase 0
- Hotspot residues from Phase 1 (e.g.,
["A42", "A45", "A78"]) - Binder length range from Phase 1.5 (e.g.,
"70-110") - Number of designs (scale based on target difficulty -- typically 50--200 backbones)
Execution
- Each call produces backbone PDB files with backbone atoms (N, CA, C, O) for both binder (chain X) and target chains.
- Parallelization: Each RFdiffusion call should generate at most 10 designs. For larger counts, split into parallel calls (e.g., 100 designs = 10 parallel calls x 10 designs each). This is significantly faster than a single call with
num_designs=100. - Optional beta model variant available (
use_beta_model, default: off).
Output
Independent backbone scaffolds, registered for downstream sequence design.
Design tips for hotspot selection, target truncation, and scale:
references/design-tips.md-- load when planning a design campaign or troubleshooting RFdiffusion.
Phase 3: Sequence Design
Purpose: Design amino acid sequences predicted to fold into each backbone.
Temperature Schedule
Use exponential decay across sequence rounds to shift from exploration to convergence. Recommended starting point:
| Round | Temperature | Behavior |
|---|---|---|
| 1 | ~1.0 | High diversity -- broad exploration |
| 2 | ~0.1 | Convergent -- focused sampling |
| 3 | ~0.01 | Refinement -- near-deterministic |
Adjust the decay rate and number of rounds based on how quickly valid designs emerge. If round 1 already produces many valid designs, fewer rounds are needed.
Tool Selection
ProteinMPNN (default): Graph neural network for inverse folding.
model_type:"soluble"chains_to_design:"X"(binder only; target fixed)omit_AAs:"X"(exclude non-standard)- Confidence: native 0--1 range (higher = better)
ESM-IF1 (alternative): Structure-conditioned language model.
mode:"sample",multichain_backbone: True- Confidence normalization:
1 / (1 + exp(-(log_likelihood + 2.0) * 1.5))
Sequence Filtering
- Over-generate candidates (e.g., 10x the desired count), then keep only the top sequences by confidence score.
- Deduplicate -- skip any sequence already generated in the current or prior rounds.
Full parameter tables:
references/parameters.md
Phase 4: Validation
Purpose: Multi-metric validation with early termination for failing designs.
Validation Pipeline (ordered by cost)
Sequence --> Solubility --> Boltz-2 --> Metrics --> US-align --> DockQ --> Pass/Fail
Step 1: Solubility Screening
Fine-tuned ESM-2 classifier predicts E. coli expression solubility. Binary decision:
- Soluble: proceed to structure prediction
- Insoluble: early termination -- skip all remaining steps
Step 2: Structure Prediction (Boltz-2)
Predict binder-target complex structure from sequences.
- Binder chain X: single-sequence mode (no MSA)
- Target chains: MSAs auto-generated by Boltz-2
Step 3: Extract Confidence Metrics
From Boltz-2 output:
- pLDDT: Per-residue confidence averaged across full structure (0--100)
- iPTM: Interface predicted TM-score across chain interfaces (0--1)
- ipSAE:
avg(d0res_min, d0res_max)for binder-target pairs only (0--1). Skipped if no interface detected.
Step 4: Structural Alignment (US-align)
Compare predicted binder (chain X) vs RFdiffusion backbone (chain X) in monomer mode.
- Output: TM-score (0--1) -- whether the sequence folds into the intended shape
Step 5: Interface Quality (DockQ)
Evaluate predicted complex vs RFdiffusion design. Binder-target pairs only (e.g., [["X", "A"]]).
- Output: DockQ score (0--1) plus iRMSD, LRMSD, fnat components
Step 6: Pass/Fail Decision
A binder passes only if ALL metrics meet their thresholds (AND logic). Check all metrics and report all failures, not just the first.
Recommended thresholds (balanced):
| Metric | Threshold | Direction |
|---|---|---|
| pLDDT | >= 70.0 | Higher is better |
| iPTM | >= 0.65 | Higher is better |
| ipSAE | >= 0.5 | Higher is better |
| TM-score | >= 0.5 | Higher is better |
| DockQ | >= 0.23 | Higher is better |
Adjust thresholds based on campaign goals -- use lenient thresholds for difficult targets to avoid rejecting everything, or strict thresholds when high confidence is required.
Lenient and strict threshold variants:
references/parameters.mdDetailed metric definitions and interpretation:references/validation-metrics.md-- load when interpreting results or adjusting thresholds.
Phase 5: Round Advancement
Purpose: After validating all designs in a round, decide whether to continue or finalize.
Decision Guidelines
After each validation round, decide the next action:
- Finalize -- If enough valid designs have been found (target met), or if the user is satisfied with results so far.
- Advance sequence round -- If the current backbone produced some promising results, try new sequences at a lower temperature to refine around successful folds. Same backbone, narrower sampling.
- Advance backbone -- If the current backbone's sequence rounds are exhausted or producing diminishing returns, move to the next backbone and reset the temperature schedule.
- Finalize with partial results -- If all backbones and rounds are exhausted, report whatever valid designs were found.
Use the valid design rate and per-metric failure patterns to guide the decision. If most failures are on the same metric (e.g., DockQ), the issue is likely the backbone geometry rather than sequence design -- advancing to a new backbone is more productive than more sequence rounds.
Reporting
- Round report: After each round -- iterations tested, valid found, failures by metric, temperature used, duplicates skipped.
- Final report: At completion -- comprehensive summary of entire design campaign.
Reference Files
| File | Contents | Load when... |
|---|---|---|
references/parameters.md |
All parameter tables by phase: epitope scoring, backbone generation, sequence design, validation thresholds (lenient/balanced/strict), Boltz-2 config, iteration limits | Adjusting parameters or troubleshooting thresholds |
references/validation-metrics.md |
Metric definitions: pLDDT, iPTM, ipSAE, TM-score, DockQ, Solubility -- range, source, interpretation, pipeline usage | Interpreting validation results or explaining metrics |
references/design-tips.md |
Practical guidance: hotspot selection heuristics, target truncation for large proteins, design scale recommendations, common failure modes | Planning a design campaign or troubleshooting poor results |
More from aminoanalytica/amina-skills
uniprot-database
Query and retrieve protein sequences, annotations, and functional data from UniProt. Supports text search, ID mapping between databases, batch downloads, and access to Swiss-Prot (reviewed) and TrEMBL (predicted) entries.
29rdkit
Python cheminformatics library for molecular manipulation and analysis. Parse SMILES/SDF/MOL formats, compute descriptors (MW, LogP, TPSA), generate fingerprints (Morgan, MACCS), perform substructure queries with SMARTS, create 2D/3D geometries, calculate similarity, and run chemical reactions.
28pdb-database
Query and retrieve protein/nucleic acid structures from RCSB PDB. Use when you need to search the PDB database for structures or metadata. Supports text, sequence, and structure-based searches, coordinate downloads, and metadata retrieval for structural biology workflows.
28biopython
Python toolkit for computational biology. Use when asked to "parse FASTA", "read GenBank", "query NCBI", "run BLAST", "analyze protein structure", "build phylogenetic tree", or work with biological sequences. Handles sequence I/O, database access, alignments, structure analysis, and phylogenetics.
28alphafold-database
Query and retrieve AI-predicted protein structures from DeepMind's AlphaFold database. Fetch structures via UniProt accession, interpret pLDDT/PAE confidence scores, and access bulk proteome data for structural biology workflows.
28