AI for Drug Discovery
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
AI for drug discovery applies machine learning to accelerate the identification and development of new medicines. Traditional drug discovery takes 12–15 years and costs over $2 billion per approved drug, with a failure rate exceeding 90%. AI is transforming every stage: predicting which molecules will bind to disease targets, forecasting ADMET properties (absorption, distribution, metabolism, excretion, toxicity) of drug candidates, designing entirely new molecular structures, repurposing existing drugs for new diseases, and identifying patient subgroups most likely to benefit from treatments. AlphaFold's protein structure revolution has opened entirely new possibilities for structure-based drug design.
Remembering
- Drug discovery pipeline — Stages from target identification through candidate selection, preclinical testing, clinical trials, and approval.
- Target — The biological molecule (protein, enzyme, receptor) whose modification may treat a disease.
- Lead compound — A molecule showing promising drug activity that will be optimized.
- ADMET — Absorption, Distribution, Metabolism, Excretion, Toxicity — key pharmacokinetic properties a drug must have.
- Molecular docking — Computational prediction of how a small molecule binds to a target protein.
- Virtual screening — Computationally screening large libraries of molecules against a target before expensive laboratory testing.
- QSAR (Quantitative Structure-Activity Relationship) — ML models predicting biological activity from molecular structure.
- Generative chemistry — Using generative models (VAEs, GANs, reinforcement learning, diffusion) to design novel drug-like molecules.
- Drug repurposing — Finding new therapeutic uses for existing approved drugs.
- Graph Neural Network (molecular) — GNNs treating molecules as graphs (atoms=nodes, bonds=edges) for property prediction.
- SMILES — Simplified Molecular Input Line Entry System; a text representation of molecular structure.
- AlphaFold (drug design) — DeepMind's protein structure prediction used to identify binding sites and design drugs targeting previously undruggable proteins.
- Clinical trial AI — ML for trial design, patient recruitment, endpoint selection, and outcome prediction.
- Biomarker discovery — Identifying molecular signatures predicting disease or drug response.
Understanding
Drug discovery is a needle-in-a-haystack problem: chemical space is estimated to contain 10^60 drug-like molecules. Testing even a tiny fraction experimentally is impossible. AI reduces this search space by learning from known active compounds which molecular features predict activity.
- Structure-based drug design with AlphaFold**: Traditionally, designing drugs required knowing the 3D structure of the target protein, obtained expensively by X-ray crystallography or cryo-EM. AlphaFold predicts protein structures computationally, opening structure-based design for thousands of previously undruggable targets. AI docking programs then predict how candidate molecules bind to the predicted structure.
- Molecular GNNs for property prediction**: Molecules are naturally represented as graphs — atoms as nodes, bonds as edges. GNNs trained on datasets of molecules with measured properties (toxicity, solubility, activity) can predict properties of new, untested molecules. Models like MPNN, SchNet, and DimeNet achieve near-experimental accuracy for some properties.
- Generative molecular design**: Rather than screening from a library, generative models design novel molecules with desired properties. Approaches include: VAEs (encode known drugs to latent space, generate new molecules by sampling/interpolating), RL (reward molecules with target properties), and diffusion models (DDPM on molecular graphs). DeepMind's AlphaFold3 and RFDiffusion can even generate protein sequences that fold to desired binding shapes.
- Drug repurposing with knowledge graphs**: Building graphs connecting drugs, targets, diseases, genes, and pathways, then using GNNs to predict new drug-disease links. Baricitinib (originally for arthritis) was identified as a potential COVID-19 treatment through AI repurposing and subsequently validated.
Applying
Molecular property prediction with DeepChem and GNN: <syntaxhighlight lang="python"> import deepchem as dc from deepchem.models import AttentiveFPModel import numpy as np
- HIV dataset: predict HIV inhibition from molecular structure (SMILES)
tasks, datasets, transformers = dc.molnet.load_hiv(
featurizer='MolGraphConv', # Graph representation of molecules splitter='scaffold' # Split by molecular scaffold (harder, realistic)
) train, valid, test = datasets
- Attentive FP (Fingerprint) - attention-based molecular graph network
model = AttentiveFPModel(
n_tasks=1, num_layers=4, graph_feat_size=200, dropout=0.2, learning_rate=1e-3, mode='classification'
) model.fit(train, nb_epoch=50)
- Evaluate
from deepchem.metrics import Metric import deepchem.metrics as metrics metric = Metric(metrics.roc_auc_score) print(f"Train AUC-ROC: {model.evaluate(train, [metric])['roc_auc_score']:.3f}") print(f"Test AUC-ROC: {model.evaluate(test, [metric])['roc_auc_score']:.3f}")
- Predict properties for new SMILES strings
from rdkit import Chem new_smiles = ["CC(=O)Nc1ccc(O)cc1", # Paracetamol (acetaminophen)
"CC12CCC3C(C1CCC2O)CCC4=CC(=O)CCC34C"] # Testosterone
featurizer = dc.feat.MolGraphConvFeaturizer() X_new = featurizer.featurize(new_smiles) preds = model.predict_on_batch(X_new) for smi, pred in zip(new_smiles, preds):
print(f"{smi[:30]}... → HIV inhibition probability: {pred[0]:.3f}")