NYC
skills/letta-ai/skills/protein-assembly

protein-assembly

SKILL.md

Protein Assembly Skill

This skill provides structured guidance for designing fusion protein gBlock sequences that combine multiple protein components (antibody fragments, fluorescent proteins, enzyme domains) into a single optimized DNA construct.

When to Use This Skill

This skill applies to tasks that involve:

  • Designing fusion proteins from multiple sources (PDB, plasmids, protein databases)
  • Creating gBlock sequences with specific linker requirements
  • Codon optimization for GC content constraints
  • Combining fluorescent proteins with specific excitation/emission wavelengths
  • Assembling multi-domain proteins with N-terminal methionine removal

Structured Approach

Phase 1: Information Gathering and Cataloging

Objective: Collect ALL required sequence data before any design work begins.

  1. Inventory input files completely

    • Read ALL input files in their entirety (avoid truncated reads)
    • For GenBank (.gb) files, parse the complete file to extract CDS/protein sequences
    • For FASTA files, extract all sequences with their identifiers
    • For PDB ID lists, note all IDs for batch retrieval
  2. Fetch external sequences systematically

    • Query PDB API for each protein ID to retrieve amino acid sequences
    • Query relevant protein databases (e.g., fpbase for fluorescent proteins)
    • Document each retrieved sequence with its source and identifier
  3. Create a sequence catalog

    • List all available protein sequences with clear labels
    • Note the source of each sequence (PDB ID, plasmid CDS, database)
    • Identify any missing sequences before proceeding

Phase 2: Protein Identification and Selection

Objective: Match proteins to task requirements using specific criteria.

  1. Wavelength matching for fluorescent proteins

    • Search for proteins with exact wavelength matches (not approximate)
    • Verify both excitation AND emission peaks against requirements
    • Document the selected donor and acceptor proteins with rationale
  2. Binding domain identification

    • Identify proteins that bind specific molecules (substrates, ligands)
    • Cross-reference PDB entries with known binding partners
    • Verify binding capability through database annotations
  3. Target protein identification

    • For antibody-related tasks, identify the target antigen
    • Use sequence homology or database lookups as needed
    • Document the identification method and confidence

Phase 3: Sequence Processing

Objective: Prepare individual protein sequences for fusion.

  1. N-terminal methionine handling

    • Remove N-terminal methionines from ALL internal proteins
    • Keep only the first protein's N-terminal methionine (if required)
    • Document which sequences were modified
  2. Sequence validation

    • Verify each sequence is complete and valid
    • Check for unusual amino acids or sequence artifacts
    • Confirm sequences match expected lengths

Phase 4: Fusion Protein Assembly

Objective: Construct the complete fusion protein sequence.

  1. Follow the specified protein order exactly

    • Do not deviate from the required arrangement
    • Document the order: [Protein1]-[Linker]-[Protein2]-[Linker]-...
  2. Design appropriate linkers

    • Use GS (Glycine-Serine) linkers of specified length
    • Common patterns: (GGGGS)n or (GS)n where n provides required length
    • Ensure linkers fall within length constraints (e.g., 5-20 amino acids)
  3. Assemble the complete protein sequence

    • Concatenate proteins with linkers in correct order
    • Verify the assembled sequence is continuous and valid

Phase 5: Codon Optimization and DNA Generation

Objective: Convert protein to optimized DNA sequence.

  1. Initial codon translation

    • Convert each amino acid to a codon
    • Use a standard codon table for the target organism
  2. GC content optimization

    • Calculate GC content in sliding windows (e.g., 50 nucleotides)
    • Identify windows outside acceptable range (e.g., 30-70%)
    • Swap synonymous codons to bring GC content within range
    • Re-verify after each swap
  3. Length verification

    • Confirm DNA sequence meets length constraints (e.g., ≤3000 nt)
    • If too long, review design choices (linker lengths, protein selections)

Phase 6: Output Generation

Objective: Create the required output file(s).

  1. Write output immediately after assembly

    • Do not delay output file creation
    • Write to the exact path specified in requirements
  2. Include appropriate formatting

    • Follow any specified format (plain text, FASTA, etc.)
    • Include headers or metadata if required
  3. Verify output file exists

    • Confirm the file was created successfully
    • Verify file contents match the designed sequence

Verification Checkpoints

After Phase 1:

  • All input files read completely (no truncation)
  • All external sequences retrieved
  • Sequence catalog is complete

After Phase 2:

  • All required proteins identified
  • Wavelength/binding requirements verified
  • Selection rationale documented

After Phase 3:

  • N-terminal methionines handled correctly
  • All sequences validated

After Phase 4:

  • Protein order matches requirements
  • Linkers meet length constraints
  • Complete fusion sequence assembled

After Phase 5:

  • GC content within range in ALL windows
  • DNA length within constraints

After Phase 6:

  • Output file exists at specified path
  • File contents are correct

Common Pitfalls

  1. Incomplete file reading

    • GenBank files may be large; ensure complete parsing
    • Extract CDS translations, not just raw sequences
  2. Approximate wavelength matching

    • Use exact values, not "close enough" matches
    • Verify both excitation AND emission, not just one
  3. Forgetting N-terminal methionines

    • Internal proteins in fusions should have Met removed
    • Only the first protein retains its N-terminal Met
  4. Ignoring GC content windows

    • Check ALL sliding windows, not just overall GC%
    • Optimize problematic regions with synonymous codons
  5. Delayed output generation

    • Create output file as soon as sequence is ready
    • Do not continue gathering information after design is complete
  6. Information gathering loops

    • Set a clear stopping point for research
    • Progress to execution even with incomplete information
    • A partial solution is better than no solution

Output-First Strategy

If time or resources are constrained:

  1. Create the output file early, even with placeholders
  2. Update the file as each component is determined
  3. Ensure a valid (if imperfect) output exists at task end

This ensures the primary deliverable exists, which can be refined with additional information.

Weekly Installs
18
Repository
letta-ai/skills
First Seen
Jan 24, 2026
Installed on
claude-code13
codex12
gemini-cli12
opencode12
antigravity11
cursor9