Protein Engineering

From BloomWiki
Revision as of 01:56, 25 April 2026 by Wordpad (talk | contribs) (BloomWiki: Protein Engineering)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Protein engineering with AI applies machine learning to the design, optimization, and generation of proteins with desired functional properties. Proteins perform virtually every biological function — enzyme catalysis, structural support, signaling, immune defense — and engineered proteins have enormous therapeutic, industrial, and research value. Traditional protein engineering required years of iterative experimental cycles. AI collapses this timeline dramatically: models can suggest variants predicted to improve stability, activity, or binding, dramatically reducing the experimental search space and enabling the design of entirely new proteins that evolution never produced.

Remembering[edit]

  • Protein engineering — The modification or de novo design of proteins to achieve desired properties (stability, activity, specificity).
  • Directed evolution — A laboratory technique using iterative rounds of random mutation and selection to improve proteins; Nobel Prize 2018.
  • Sequence-function landscape — The mapping from protein sequence to function; engineering navigates this landscape.
  • Fitness landscape — The mapping from sequence to some fitness measure (stability, activity); engineering seeks fitness peaks.
  • Zero-shot variant effect prediction — Predicting which protein mutations improve function without any experimental data on that protein.
  • Protein stability engineering — Designing mutations that increase thermostability, solubility, or shelf life.
  • Enzyme design — Engineering or creating new enzymes with desired catalytic activity.
  • Binding affinity optimization — Improving how tightly a protein binds its target; key for antibody and drug engineering.
  • Protein language model (PLM) — A language model pre-trained on protein sequences; ESM-2, ProGen2, ProtGPT2.
  • Inverse folding — Given a target 3D structure, design a sequence that will fold to it; ESMFold, ProteinMPNN.
  • RFDiffusion — A diffusion model generating novel protein backbone structures conditioned on binding constraints.
  • ProteinMPNN — A message-passing neural network for inverse folding; sequence design for given protein backbones.
  • Antibody engineering — Designing antibodies with desired binding specificity, affinity, and biophysical properties.
  • Directed evolution in silico — Using ML fitness predictors to simulate directed evolution without wet lab experiments.
  • ProGen / ESM-2 / Progen2 — Large protein language models enabling sequence generation and variant prediction.

Understanding[edit]

Protein engineering AI accelerates the two core engineering tasks: variant optimization (improving an existing protein) and de novo design (creating entirely new proteins).

Variant effect prediction: Given a protein sequence and a mutation, predict whether the mutation improves or worsens a desired property. Zero-shot approaches use protein language model log-likelihood — a mutation that the PLM assigns higher probability than the wild-type is more evolutionarily "accepted" and likely functionally tolerable. ESM-2 log-ratios correlate significantly with experimental fitness data (DMS). This enables in silico screening of millions of variants.

Inverse folding for sequence design: Given a desired 3D structure (from AlphaFold or experimental data), design sequences that will fold to that structure. ProteinMPNN (Dauparas et al., 2022) is the standard tool — it takes a backbone structure as input and generates sequences compatible with that backbone, achieving experimental success rates of 50–80% for designed sequences. This replaces computationally expensive Rosetta-based design.

RFDiffusion for de novo backbone generation: RFDiffusion (Watson et al., 2023) generates novel protein backbone structures by running the diffusion process in the space of protein structures. It can generate binders to arbitrary targets, symmetric assemblies, enzyme active sites, and more — with experimental validation. Combined with ProteinMPNN (RFDiffusion → backbone → ProteinMPNN → sequence → validate), this pipeline has created proteins binding previously undruggable targets.

Antibody engineering: Antibodies are the dominant class of biologic drugs (~$250B market). AI systems like AbMap, AntiBERTy, IgLM, and proprietary tools at Absci, Insilico Medicine, and BigHat Biosciences design and optimize antibodies by training on billions of known antibody sequences. They predict binding affinity, developability (manufacturability), and immunogenicity.

Applying[edit]

Zero-shot protein variant prediction with ESM-2: <syntaxhighlight lang="python"> import torch from transformers import EsmTokenizer, EsmForMaskedLM import numpy as np from itertools import product

tokenizer = EsmTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D") model = EsmForMaskedLM.from_pretrained("facebook/esm2_t33_650M_UR50D") model.eval()

AMINO_ACIDS = "ACDEFGHIKLMNPQRSTVWY"

def compute_variant_scores(wild_type_seq: str) -> dict:

   """
   Compute zero-shot fitness scores for all single-site mutants.
   Uses ESM-2 masked marginal log-likelihood.
   """
   encoded = tokenizer(wild_type_seq, return_tensors='pt')
   N = len(wild_type_seq)
   scores = {}
   with torch.no_grad():
       # Get wild-type log-likelihoods at each position
       for pos in range(N):
           masked_input = encoded['input_ids'].clone()
           masked_input[0, pos + 1] = tokenizer.mask_token_id  # +1 for [CLS]
           output = model(input_ids=masked_input,
                         attention_mask=encoded['attention_mask'])
           log_probs = torch.log_softmax(output.logits[0, pos + 1], dim=-1)
           wt_aa = wild_type_seq[pos]
           wt_score = log_probs[tokenizer.convert_tokens_to_ids(wt_aa)].item()
           for mut_aa in AMINO_ACIDS:
               if mut_aa == wt_aa: continue
               mut_score = log_probs[tokenizer.convert_tokens_to_ids(mut_aa)].item()
               delta_llr = mut_score - wt_score  # Positive = preferred over WT
               scores[f"{wt_aa}{pos+1}{mut_aa}"] = delta_llr
   return scores
  1. Rank mutations by predicted fitness improvement

scores = compute_variant_scores("MKTAYIAKQRQISFVK...") top_mutations = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:20] print("Top predicted beneficial mutations:") for variant, score in top_mutations:

   print(f"  {variant}: {score:.3f}")
  1. Combinatorial design: combine top singles via in silico directed evolution
  2. 1. Select top 5 single mutants
  3. 2. Generate all pairwise combinations
  4. 3. Score each combination
  5. 4. Prioritize top combinations for experimental validation

</syntaxhighlight>

Protein engineering AI tools
Variant prediction (zero-shot) → ESM-2 (Meta), EVE, ProteinGym benchmark
Inverse folding (sequence design) → ProteinMPNN, LigandMPNN, ESM-IF
De novo backbone design → RFDiffusion, Chroma (Generate Biomedicines), FrameDiff
Antibody engineering → AntiFold, AbMap, IgLM, Absolut
Structure-based design → PyRosetta, Boltz-1, AlphaFold3 + design

Analyzing[edit]

Protein Engineering AI Methods
Method Design Type Experimental Success Rate Key Advantage
Directed evolution (wet lab) Optimization High (iterative) No model needed
ESM-2 zero-shot Variant ranking ~50-70% top variants better No data needed
ProteinMPNN Sequence for backbone 50-80% fold correctly Fast, reliable
RFDiffusion De novo binder/design 10-30% experimental success Entirely new proteins
Rosetta + ML Stability engineering 30-60% improvement Interpretable energy terms

Failure modes: In silico predictions not matching wet lab results (correlation ~0.5-0.7, not 1.0). Epistatic interactions — individual beneficial mutations are incompatible in combination. Distribution shift — models trained on natural protein diversity may not generalize to extreme engineering targets. Fitness function misspecification — optimizing the wrong property (stability alone doesn't mean functional). False negatives — discarding viable variants due to model uncertainty.

Evaluating[edit]

Protein engineering AI evaluation:

  1. Spearman correlation on DMS datasets: ProteinGym benchmark provides 250+ deep mutational scanning datasets for rigorous evaluation.
  2. Experimental hit rate: what fraction of AI-selected variants show improved function in the lab? Compare to random selection and saturation mutagenesis baselines.
  3. Top-K precision: of the top 10 AI-predicted variants, how many actually improve fitness?
  4. Generalization: evaluate on proteins with low sequence identity to training data.

Creating[edit]

Designing an AI-assisted protein engineering campaign:

  1. Problem definition: specify target property (thermostability, catalytic activity, binding affinity), measurement assay, and sequence library size budget.
  2. Zero-shot screen: use ESM-2 log-likelihood ratios to rank all possible single mutants; select top 50 for experimental testing.
  3. Round 1: assay 50 variants; use results to train Gaussian process surrogate model.
  4. Bayesian optimization: use surrogate to suggest next 50 variants balancing exploration (uncertainty) and exploitation (predicted fitness).
  5. Iteration: 3–5 rounds typically sufficient to reach engineering target.
  6. Combinatorial validation: test top single → combine best singles → validate combined variants.