Editing Protein Engineering with AI

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Protein engineering with AI applies machine learning to the design, optimization, and generation of proteins with desired functional properties. Proteins perform virtually every biological function — enzyme catalysis, structural support, signaling, immune defense — and engineered proteins have enormous therapeutic, industrial, and research value. Traditional protein engineering required years of iterative experimental cycles. AI collapses this timeline dramatically: models can suggest variants predicted to improve stability, activity, or binding, dramatically reducing the experimental search space and enabling the design of entirely new proteins that evolution never produced.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Protein engineering''' — The modification or de novo design of proteins to achieve desired properties (stability, activity, specificity).
* '''Directed evolution''' — A laboratory technique using iterative rounds of random mutation and selection to improve proteins; Nobel Prize 2018.
* '''Sequence-function landscape''' — The mapping from protein sequence to function; engineering navigates this landscape.
* '''Fitness landscape''' — The mapping from sequence to some fitness measure (stability, activity); engineering seeks fitness peaks.
* '''Zero-shot variant effect prediction''' — Predicting which protein mutations improve function without any experimental data on that protein.
* '''Protein stability engineering''' — Designing mutations that increase thermostability, solubility, or shelf life.
* '''Enzyme design''' — Engineering or creating new enzymes with desired catalytic activity.
* '''Binding affinity optimization''' — Improving how tightly a protein binds its target; key for antibody and drug engineering.
* '''Protein language model (PLM)''' — A language model pre-trained on protein sequences; ESM-2, ProGen2, ProtGPT2.
* '''Inverse folding''' — Given a target 3D structure, design a sequence that will fold to it; ESMFold, ProteinMPNN.
* '''RFDiffusion''' — A diffusion model generating novel protein backbone structures conditioned on binding constraints.
* '''ProteinMPNN''' — A message-passing neural network for inverse folding; sequence design for given protein backbones.
* '''Antibody engineering''' — Designing antibodies with desired binding specificity, affinity, and biophysical properties.
* '''Directed evolution in silico''' — Using ML fitness predictors to simulate directed evolution without wet lab experiments.
* '''ProGen / ESM-2 / Progen2''' — Large protein language models enabling sequence generation and variant prediction.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
Protein engineering AI accelerates the two core engineering tasks: **variant optimization** (improving an existing protein) and **de novo design** (creating entirely new proteins).

**Variant effect prediction**: Given a protein sequence and a mutation, predict whether the mutation improves or worsens a desired property. Zero-shot approaches use protein language model log-likelihood — a mutation that the PLM assigns higher probability than the wild-type is more evolutionarily "accepted" and likely functionally tolerable. ESM-2 log-ratios correlate significantly with experimental fitness data (DMS). This enables in silico screening of millions of variants.

**Inverse folding for sequence design**: Given a desired 3D structure (from AlphaFold or experimental data), design sequences that will fold to that structure. ProteinMPNN (Dauparas et al., 2022) is the standard tool — it takes a backbone structure as input and generates sequences compatible with that backbone, achieving experimental success rates of 50–80% for designed sequences. This replaces computationally expensive Rosetta-based design.

**RFDiffusion for de novo backbone generation**: RFDiffusion (Watson et al., 2023) generates novel protein backbone structures by running the diffusion process in the space of protein structures. It can generate binders to arbitrary targets, symmetric assemblies, enzyme active sites, and more — with experimental validation. Combined with ProteinMPNN (RFDiffusion → backbone → ProteinMPNN → sequence → validate), this pipeline has created proteins binding previously undruggable targets.

**Antibody engineering**: Antibodies are the dominant class of biologic drugs (~$250B market). AI systems like AbMap, AntiBERTy, IgLM, and proprietary tools at Absci, Insilico Medicine, and BigHat Biosciences design and optimize antibodies by training on billions of known antibody sequences. They predict binding affinity, developability (manufacturability), and immunogenicity.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Zero-shot protein variant prediction with ESM-2:'''
<syntaxhighlight lang="python">
import torch
from transformers import EsmTokenizer, EsmForMaskedLM
import numpy as np
from itertools import product

tokenizer = EsmTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = EsmForMaskedLM.from_pretrained("facebook/esm2_t33_650M_UR50D")
model.eval()

AMINO_ACIDS = "ACDEFGHIKLMNPQRSTVWY"

def compute_variant_scores(wild_type_seq: str) -> dict:
    """
    Compute zero-shot fitness scores for all single-site mutants.
    Uses ESM-2 masked marginal log-likelihood.
    """
    encoded = tokenizer(wild_type_seq, return_tensors='pt')
    N = len(wild_type_seq)
    scores = {}

    with torch.no_grad():
        # Get wild-type log-likelihoods at each position
        for pos in range(N):
            masked_input = encoded['input_ids'].clone()
            masked_input[0, pos + 1] = tokenizer.mask_token_id  # +1 for [CLS]
            output = model(input_ids=masked_input,
                          attention_mask=encoded['attention_mask'])
            log_probs = torch.log_softmax(output.logits[0, pos + 1], dim=-1)
            wt_aa = wild_type_seq[pos]
            wt_score = log_probs[tokenizer.convert_tokens_to_ids(wt_aa)].item()
            for mut_aa in AMINO_ACIDS:
                if mut_aa == wt_aa: continue
                mut_score = log_probs[tokenizer.convert_tokens_to_ids(mut_aa)].item()
                delta_llr = mut_score - wt_score  # Positive = preferred over WT
                scores[f"{wt_aa}{pos+1}{mut_aa}"] = delta_llr

    return scores

# Rank mutations by predicted fitness improvement
scores = compute_variant_scores("MKTAYIAKQRQISFVK...")
top_mutations = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:20]
print("Top predicted beneficial mutations:")
for variant, score in top_mutations:
    print(f"  {variant}: {score:.3f}")

# Combinatorial design: combine top singles via in silico directed evolution
# 1. Select top 5 single mutants
# 2. Generate all pairwise combinations
# 3. Score each combination
# 4. Prioritize top combinations for experimental validation
</syntaxhighlight>

; Protein engineering AI tools
: '''Variant prediction (zero-shot)''' → ESM-2 (Meta), EVE, ProteinGym benchmark
: '''Inverse folding (sequence design)''' → ProteinMPNN, LigandMPNN, ESM-IF
: '''De novo backbone design''' → RFDiffusion, Chroma (Generate Biomedicines), FrameDiff
: '''Antibody engineering''' → AntiFold, AbMap, IgLM, Absolut
: '''Structure-based design''' → PyRosetta, Boltz-1, AlphaFold3 + design
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ Protein Engineering AI Methods
! Method !! Design Type !! Experimental Success Rate !! Key Advantage
|-
| Directed evolution (wet lab) || Optimization || High (iterative) || No model needed
|-
| ESM-2 zero-shot || Variant ranking || ~50-70% top variants better || No data needed
|-
| ProteinMPNN || Sequence for backbone || 50-80% fold correctly || Fast, reliable
|-
| RFDiffusion || De novo binder/design || 10-30% experimental success || Entirely new proteins
|-
| Rosetta + ML || Stability engineering || 30-60% improvement || Interpretable energy terms
|}

'''Failure modes''': In silico predictions not matching wet lab results (correlation ~0.5-0.7, not 1.0). Epistatic interactions — individual beneficial mutations are incompatible in combination. Distribution shift — models trained on natural protein diversity may not generalize to extreme engineering targets. Fitness function misspecification — optimizing the wrong property (stability alone doesn't mean functional). False negatives — discarding viable variants due to model uncertainty.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
Protein engineering AI evaluation: (1) **Spearman correlation on DMS datasets**: ProteinGym benchmark provides 250+ deep mutational scanning datasets for rigorous evaluation. (2) **Experimental hit rate**: what fraction of AI-selected variants show improved function in the lab? Compare to random selection and saturation mutagenesis baselines. (3) **Top-K precision**: of the top 10 AI-predicted variants, how many actually improve fitness? (4) **Generalization**: evaluate on proteins with low sequence identity to training data.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Designing an AI-assisted protein engineering campaign: (1) **Problem definition**: specify target property (thermostability, catalytic activity, binding affinity), measurement assay, and sequence library size budget. (2) **Zero-shot screen**: use ESM-2 log-likelihood ratios to rank all possible single mutants; select top 50 for experimental testing. (3) **Round 1**: assay 50 variants; use results to train Gaussian process surrogate model. (4) **Bayesian optimization**: use surrogate to suggest next 50 variants balancing exploration (uncertainty) and exploitation (predicted fitness). (5) **Iteration**: 3–5 rounds typically sufficient to reach engineering target. (6) **Combinatorial validation**: test top single → combine best singles → validate combined variants.

[[Category:Artificial Intelligence]]
[[Category:Protein Engineering]]
[[Category:Bioinformatics]]
</div>