Editing AI for Genomics and Bioinformatics

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
AI for genomics and bioinformatics applies machine learning to the analysis of biological sequence data — DNA, RNA, and proteins — to understand the genetic basis of life, disease, and evolution. The sequencing revolution has generated petabytes of genomic data; extracting biological meaning from this data requires sophisticated computational tools. AI now drives breakthroughs in predicting protein structure (AlphaFold), identifying disease-causing genetic variants, designing new genes and proteins, and understanding gene regulation. Genomics AI is transforming medicine, agriculture, and our fundamental understanding of biology.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Genome''' — The complete set of DNA in an organism, containing all genetic information.
* '''DNA sequencing''' — Determining the order of nucleotide bases (A, T, G, C) in a DNA molecule.
* '''Single Nucleotide Polymorphism (SNP)''' — A single base-pair variation in the genome; billions of SNPs exist in the human population.
* '''GWAS (Genome-Wide Association Study)''' — A statistical study associating genetic variants with traits or diseases across many individuals.
* '''Variant calling''' — The computational process of identifying genetic variants from sequencing data.
* '''RNA-seq''' — Sequencing RNA molecules to measure gene expression levels across the genome.
* '''Gene expression''' — The process by which information from a gene is used to synthesize gene products (RNA, proteins).
* '''Protein structure prediction''' — Predicting the 3D shape of a protein from its amino acid sequence.
* '''AlphaFold''' — DeepMind's revolutionary AI system for protein structure prediction; solved the 50-year protein folding problem.
* '''Sequence alignment''' — Comparing biological sequences to identify regions of similarity; fundamental to all bioinformatics.
* '''k-mer''' — A subsequence of length k; used to represent genomic sequences as features for ML models.
* '''Epigenomics''' — The study of heritable changes in gene expression not caused by DNA sequence changes; includes DNA methylation and histone modification.
* '''Single-cell sequencing''' — Measuring gene expression in individual cells rather than bulk tissue; reveals cellular heterogeneity.
* '''Polygenic risk score (PRS)''' — An ML-derived score aggregating many small-effect genetic variants to predict disease risk.
* '''CRISPR''' — A gene editing technology; AI assists in designing guide RNAs and predicting off-target effects.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
Genomics is inherently a machine learning problem: biological sequences are discrete strings over small alphabets (A, T, G, C for DNA; 20 amino acids for proteins), and the relationship between sequence and function is complex, non-linear, and learned from evolutionary history across billions of years.

**The protein folding revolution**: For 50 years, predicting a protein's 3D structure from its amino acid sequence was considered one of biology's grand challenges. DeepMind's AlphaFold2 (2020) solved this with near-experimental accuracy, using a combination of multiple sequence alignment, equivariant attention networks, and self-supervised learning on known protein structures. AlphaFold has predicted structures for virtually all known proteins — ~200M structures — in the AlphaFold Protein Structure Database, transforming drug discovery.

**Genomic language models**: BERT-style transformers pre-trained on DNA sequences learn rich representations of genomic function. Models like Enformer predict gene expression from DNA sequence by learning the regulatory grammar encoded in non-coding regions. DNABERT, Nucleotide Transformer, and Evo (trained on billions of DNA sequences) have achieved state-of-the-art on diverse genomic prediction tasks.

**Single-cell AI**: scRNA-seq measures gene expression in individual cells, generating sparse high-dimensional count matrices (cells × genes). AI tools like Seurat and Scanpy cluster cells by type; foundation models for single-cell data (scGPT, Geneformer) enable zero-shot cell type annotation, perturbation prediction, and drug response modeling.

**Polygenic Risk Scores (PRS)**: Aggregating thousands of small-effect genetic variants into a single disease risk score. Modern PRS methods use penalized regression (LASSO) and Bayesian approaches on GWAS summary statistics. PRS can predict cardiovascular disease, type 2 diabetes, and schizophrenia risk years before clinical onset.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Protein secondary structure prediction with a sequence transformer:'''
<syntaxhighlight lang="python">
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# ESM-2: Facebook's protein language model
# Fine-tuned for secondary structure prediction (Helix/Sheet/Coil per residue)
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = AutoModelForTokenClassification.from_pretrained(
    "facebook/esm2_t33_650M_UR50D",
    num_labels=3  # H (helix), E (sheet), C (coil)
)

# Protein sequence
protein = "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQTLGQHDFSAGEGLYTHMKALRPDEDRLSPLHSVYVDQWDWERVMGDGERQFSTLKSTVEAIWAGIKATEAAVSEEFGLAPFLPDQIHFVHSQELLSRYPDLDAKGRERAIAKDLGAVFLVGIGGKLSDGHRHDVRAPDYDDWSTPSELGHAGLNGDILVWNPVLEDAFELSSMGIRVDADTLKHQLALTGDEDRLELEWHQALLRGEMPQTIGGGIGQSRLTMLLLQLPHIGQVQAGVWPAAVRESVPSLL"
inputs = tokenizer(protein, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits.argmax(dim=-1)[0][1:-1]  # Remove CLS/EOS

labels = {0: "H", 1: "E", 2: "C"}
structure = "".join([labels[p.item()] for p in predictions])
print(f"Sequence: {protein[:50]}...")
print(f"Structure: {structure[:50]}...")

# AlphaFold for full 3D structure (use via ColabFold for accessible inference)
# from colabfold.run import run
# run(queries=[("protein", protein)], result_dir="./structures/")
</syntaxhighlight>

; Genomics AI tools and resources
: '''Protein structure''' → AlphaFold2/3 (ColabFold for easy access), ESMFold, RoseTTAFold
: '''Gene expression''' → Enformer (sequence → expression), Geneformer, scGPT
: '''Variant interpretation''' → DeepVariant (variant calling), PrimateAI (pathogenicity)
: '''CRISPR design''' → CRISPR-ML (off-target prediction), DeepCRISPR
: '''Genomics pipelines''' → Bioconductor (R), Scanpy/Seurat (single-cell), GATK (variant calling)
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ Genomics AI Application Maturity
! Application !! AI Approach !! Clinical Use || Key Metric
|-
| Protein structure prediction || AlphaFold (transformer + MSA) || Drug design || TM-score (>0.7 = reliable)
|-
| Variant pathogenicity || Ensemble ML, deep learning || Clinical genetics || AUC on ClinVar benchmarks
|-
| Cancer genomics || CNN on somatic mutations || Research/clinical || Driver gene identification accuracy
|-
| PRS disease risk || Penalized regression, Bayesian || Research → clinical || C-statistic (AUC)
|-
| scRNA-seq cell typing || Clustering + transfer learning || Research || F1 on held-out datasets
|-
| CRISPR off-target prediction || CNN, transformers || Research || AUROC on off-target sites
|}

'''Failure modes''': Batch effects — systematic technical differences between sequencing batches can dominate biological signal. Genetic ancestry bias — GWAS and PRS trained predominantly on European-ancestry populations perform poorly for other groups. Data leakage from protein databases (model trained on homologous sequences performs poorly on truly novel proteins). Interpretation of non-coding variants remains extremely challenging.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
Genomics AI evaluation requires domain-specific rigor: (1) **Held-out proteins/genes**: ensure no homologous sequences in test set (sequence identity <30%). (2) **Cross-ancestry validation**: evaluate PRS performance separately in different ancestry groups. (3) **Wet lab validation**: computational predictions must be validated experimentally; paper predictions vs. experimental outcomes. (4) **Benchmark databases**: CASP (protein structure), ClinVar (variant pathogenicity), GenomicsDB (GWAS hits). (5) **Interpretability**: which genomic features does the model rely on? Should match known biology.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Designing a genomics AI pipeline for disease risk prediction: (1) Data: obtain UK Biobank or similar population cohort GWAS summary statistics. (2) PRS construction: use PRS-CS or LDpred2 for Bayesian shrinkage estimation across all common variants. (3) Validation: validate PRS in independent cohort with different ancestry background. (4) Clinical integration: combine PRS with clinical risk factors (age, BMI, family history) using a joint logistic regression model. (5) Calibration: recalibrate model for target population; generate risk percentile scores. (6) Ethical review: ensure compliance with genomic data privacy regulations (HIPAA, GDPR); obtain IRB approval for clinical use.

[[Category:Artificial Intelligence]]
[[Category:Genomics]]
[[Category:Bioinformatics]]
</div>