AI for Genomics and Bioinformatics
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
AI for genomics and bioinformatics applies machine learning to the analysis of biological sequence data — DNA, RNA, and proteins — to understand the genetic basis of life, disease, and evolution. The sequencing revolution has generated petabytes of genomic data; extracting biological meaning from this data requires sophisticated computational tools. AI now drives breakthroughs in predicting protein structure (AlphaFold), identifying disease-causing genetic variants, designing new genes and proteins, and understanding gene regulation. Genomics AI is transforming medicine, agriculture, and our fundamental understanding of biology.
Remembering[edit]
- Genome — The complete set of DNA in an organism, containing all genetic information.
- DNA sequencing — Determining the order of nucleotide bases (A, T, G, C) in a DNA molecule.
- Single Nucleotide Polymorphism (SNP) — A single base-pair variation in the genome; billions of SNPs exist in the human population.
- GWAS (Genome-Wide Association Study) — A statistical study associating genetic variants with traits or diseases across many individuals.
- Variant calling — The computational process of identifying genetic variants from sequencing data.
- RNA-seq — Sequencing RNA molecules to measure gene expression levels across the genome.
- Gene expression — The process by which information from a gene is used to synthesize gene products (RNA, proteins).
- Protein structure prediction — Predicting the 3D shape of a protein from its amino acid sequence.
- AlphaFold — DeepMind's revolutionary AI system for protein structure prediction; solved the 50-year protein folding problem.
- Sequence alignment — Comparing biological sequences to identify regions of similarity; fundamental to all bioinformatics.
- k-mer — A subsequence of length k; used to represent genomic sequences as features for ML models.
- Epigenomics — The study of heritable changes in gene expression not caused by DNA sequence changes; includes DNA methylation and histone modification.
- Single-cell sequencing — Measuring gene expression in individual cells rather than bulk tissue; reveals cellular heterogeneity.
- Polygenic risk score (PRS) — An ML-derived score aggregating many small-effect genetic variants to predict disease risk.
- CRISPR — A gene editing technology; AI assists in designing guide RNAs and predicting off-target effects.
Understanding[edit]
Genomics is inherently a machine learning problem: biological sequences are discrete strings over small alphabets (A, T, G, C for DNA; 20 amino acids for proteins), and the relationship between sequence and function is complex, non-linear, and learned from evolutionary history across billions of years.
- The protein folding revolution**: For 50 years, predicting a protein's 3D structure from its amino acid sequence was considered one of biology's grand challenges. DeepMind's AlphaFold2 (2020) solved this with near-experimental accuracy, using a combination of multiple sequence alignment, equivariant attention networks, and self-supervised learning on known protein structures. AlphaFold has predicted structures for virtually all known proteins — ~200M structures — in the AlphaFold Protein Structure Database, transforming drug discovery.
- Genomic language models**: BERT-style transformers pre-trained on DNA sequences learn rich representations of genomic function. Models like Enformer predict gene expression from DNA sequence by learning the regulatory grammar encoded in non-coding regions. DNABERT, Nucleotide Transformer, and Evo (trained on billions of DNA sequences) have achieved state-of-the-art on diverse genomic prediction tasks.
- Single-cell AI**: scRNA-seq measures gene expression in individual cells, generating sparse high-dimensional count matrices (cells × genes). AI tools like Seurat and Scanpy cluster cells by type; foundation models for single-cell data (scGPT, Geneformer) enable zero-shot cell type annotation, perturbation prediction, and drug response modeling.
- Polygenic Risk Scores (PRS)**: Aggregating thousands of small-effect genetic variants into a single disease risk score. Modern PRS methods use penalized regression (LASSO) and Bayesian approaches on GWAS summary statistics. PRS can predict cardiovascular disease, type 2 diabetes, and schizophrenia risk years before clinical onset.
Applying[edit]
Protein secondary structure prediction with a sequence transformer: <syntaxhighlight lang="python"> from transformers import AutoTokenizer, AutoModelForTokenClassification import torch
- ESM-2: Facebook's protein language model
- Fine-tuned for secondary structure prediction (Helix/Sheet/Coil per residue)
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D") model = AutoModelForTokenClassification.from_pretrained(
"facebook/esm2_t33_650M_UR50D", num_labels=3 # H (helix), E (sheet), C (coil)
)
- Protein sequence
protein = "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQTLGQHDFSAGEGLYTHMKALRPDEDRLSPLHSVYVDQWDWERVMGDGERQFSTLKSTVEAIWAGIKATEAAVSEEFGLAPFLPDQIHFVHSQELLSRYPDLDAKGRERAIAKDLGAVFLVGIGGKLSDGHRHDVRAPDYDDWSTPSELGHAGLNGDILVWNPVLEDAFELSSMGIRVDADTLKHQLALTGDEDRLELEWHQALLRGEMPQTIGGGIGQSRLTMLLLQLPHIGQVQAGVWPAAVRESVPSLL" inputs = tokenizer(protein, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs) predictions = outputs.logits.argmax(dim=-1)[0][1:-1] # Remove CLS/EOS
labels = {0: "H", 1: "E", 2: "C"} structure = "".join([labels[p.item()] for p in predictions]) print(f"Sequence: {protein[:50]}...") print(f"Structure: {structure[:50]}...")
- AlphaFold for full 3D structure (use via ColabFold for accessible inference)
- from colabfold.run import run
- run(queries=[("protein", protein)], result_dir="./structures/")
</syntaxhighlight>
- Genomics AI tools and resources
- Protein structure → AlphaFold2/3 (ColabFold for easy access), ESMFold, RoseTTAFold
- Gene expression → Enformer (sequence → expression), Geneformer, scGPT
- Variant interpretation → DeepVariant (variant calling), PrimateAI (pathogenicity)
- CRISPR design → CRISPR-ML (off-target prediction), DeepCRISPR
- Genomics pipelines → Bioconductor (R), Scanpy/Seurat (single-cell), GATK (variant calling)
Analyzing[edit]
| Application | AI Approach | Clinical Use | Key Metric |
|---|---|---|---|
| Protein structure prediction | AlphaFold (transformer + MSA) | Drug design | TM-score (>0.7 = reliable) |
| Variant pathogenicity | Ensemble ML, deep learning | Clinical genetics | AUC on ClinVar benchmarks |
| Cancer genomics | CNN on somatic mutations | Research/clinical | Driver gene identification accuracy |
| PRS disease risk | Penalized regression, Bayesian | Research → clinical | C-statistic (AUC) |
| scRNA-seq cell typing | Clustering + transfer learning | Research | F1 on held-out datasets |
| CRISPR off-target prediction | CNN, transformers | Research | AUROC on off-target sites |
Failure modes: Batch effects — systematic technical differences between sequencing batches can dominate biological signal. Genetic ancestry bias — GWAS and PRS trained predominantly on European-ancestry populations perform poorly for other groups. Data leakage from protein databases (model trained on homologous sequences performs poorly on truly novel proteins). Interpretation of non-coding variants remains extremely challenging.
Evaluating[edit]
Genomics AI evaluation requires domain-specific rigor: (1) **Held-out proteins/genes**: ensure no homologous sequences in test set (sequence identity <30%). (2) **Cross-ancestry validation**: evaluate PRS performance separately in different ancestry groups. (3) **Wet lab validation**: computational predictions must be validated experimentally; paper predictions vs. experimental outcomes. (4) **Benchmark databases**: CASP (protein structure), ClinVar (variant pathogenicity), GenomicsDB (GWAS hits). (5) **Interpretability**: which genomic features does the model rely on? Should match known biology.
Creating[edit]
Designing a genomics AI pipeline for disease risk prediction: (1) Data: obtain UK Biobank or similar population cohort GWAS summary statistics. (2) PRS construction: use PRS-CS or LDpred2 for Bayesian shrinkage estimation across all common variants. (3) Validation: validate PRS in independent cohort with different ancestry background. (4) Clinical integration: combine PRS with clinical risk factors (age, BMI, family history) using a joint logistic regression model. (5) Calibration: recalibrate model for target population; generate risk percentile scores. (6) Ethical review: ensure compliance with genomic data privacy regulations (HIPAA, GDPR); obtain IRB approval for clinical use.