Bayesian Ml
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Bayesian machine learning is a probabilistic approach to AI that treats model parameters and predictions as probability distributions rather than fixed point estimates. Instead of asking "what is the best single model?" Bayesian methods ask "what is the distribution over models given the data?" This enables principled uncertainty quantification, systematic incorporation of prior knowledge, and coherent updating of beliefs as new data arrives. Bayesian methods are foundational to Gaussian processes, probabilistic programming, active learning, and Bayesian optimization — and increasingly relevant as AI systems are deployed in high-stakes settings that demand calibrated confidence estimates.
Remembering
- Prior distribution — A probability distribution over model parameters encoding beliefs before seeing any data; P(θ).
- Likelihood — The probability of observing the data given model parameters; P(D|θ).
- Posterior distribution — Updated beliefs about parameters after observing data; P(θ|D) ∝ P(D|θ)P(θ).
- Bayes' theorem — P(θ|D) = P(D|θ)P(θ) / P(D); the foundation of Bayesian inference.
- Marginal likelihood (evidence) — P(D) = ∫ P(D|θ)P(θ)dθ; normalizing constant; used for model selection.
- Posterior predictive — Predictions averaged over the posterior: P(y|x, D) = ∫ P(y|x, θ)P(θ|D)dθ.
- Conjugate prior — A prior whose posterior has the same functional form; enables closed-form updates.
- MCMC (Markov Chain Monte Carlo) — A family of sampling algorithms for approximating intractable posterior distributions.
- Variational inference (VI) — Approximates the posterior with a simpler distribution by minimizing KL divergence; faster than MCMC.
- Gaussian process (GP) — A non-parametric Bayesian model defining a prior over functions; exactly tractable for regression.
- Bayesian neural network (BNN) — A neural network with distributions over weights, enabling uncertainty estimation.
- Monte Carlo Dropout — Approximates Bayesian uncertainty by running inference with dropout enabled at test time.
- Bayesian optimization — Using a probabilistic surrogate model (usually GP) to optimize expensive black-box functions.
- Epistemic uncertainty — Uncertainty due to lack of knowledge (data); can be reduced with more data.
- Aleatoric uncertainty — Irreducible uncertainty from inherent randomness in the data-generating process.
Understanding
The Bayesian framework offers a coherent solution to a fundamental problem: how should we update our beliefs given evidence? The answer is Bayes' theorem: start with a prior, multiply by the likelihood of the observed data, and normalize to get the posterior.
The frequentist vs. Bayesian divide: Frequentist ML (standard deep learning) treats model parameters as fixed unknowns estimated from data. Bayesian ML treats parameters as random variables with distributions — capturing our uncertainty about the true values. A single point estimate (e.g., maximum likelihood) discards this uncertainty information.
Why uncertainty matters: A model that outputs "99% confidence: benign tumor" should actually be correct 99% of the time, not just have the highest logit. In medical AI, an overconfident prediction is dangerous. Bayesian methods provide principled calibration.
Gaussian processes: A GP is a distribution over functions. Given a kernel (covariance function) and training data, GPs provide exact posterior distributions over function values, including uncertainty bounds. GPs are the backbone of Bayesian optimization: fit a GP to evaluated function values, use the posterior to identify where to evaluate next (acquisition function).
The computational challenge: Computing exact posteriors requires integrating over all parameters, which is intractable for neural networks. Solutions: MCMC (exact but slow), variational inference (approximate but fast), Laplace approximation (quadratic approximation around MAP estimate), Monte Carlo Dropout (practical approximation).
Applying
Gaussian process regression with uncertainty bounds: <syntaxhighlight lang="python"> import numpy as np import gpytorch import torch
- Training data
train_x = torch.linspace(0, 1, 20) train_y = torch.sin(train_x * 2 * np.pi) + torch.randn_like(train_x) * 0.1
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super().__init__(train_x, train_y, likelihood)
self.mean = gpytorch.means.ConstantMean()
self.covar = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
def forward(self, x):
return gpytorch.distributions.MultivariateNormal(
self.mean(x), self.covar(x))
likelihood = gpytorch.likelihoods.GaussianLikelihood() model = ExactGPModel(train_x, train_y, likelihood)
- Train GP hyperparameters
model.train(); likelihood.train() optimizer = torch.optim.Adam(model.parameters(), lr=0.1) mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model) for _ in range(100):
optimizer.zero_grad() (-mll(model(train_x), train_y)).backward() optimizer.step()
- Predict with uncertainty
model.eval(); likelihood.eval() with torch.no_grad(), gpytorch.settings.fast_pred_var():
test_x = torch.linspace(0, 1, 200) pred = likelihood(model(test_x)) mean = pred.mean lower, upper = pred.confidence_region() # 2σ bounds
</syntaxhighlight>
- Bayesian ML tool selection
- Regression with uncertainty → Gaussian processes (GPyTorch, scikit-learn GP)
- Hyperparameter optimization → Bayesian optimization (Optuna, BoTorch, SMAC)
- BNN approximation → Monte Carlo Dropout, Laplace Redux, SWAG
- Probabilistic programming → PyMC, Stan, Pyro, NumPyro
- Production calibration → Temperature scaling (post-hoc), Platt scaling
Analyzing
| Method | Calibration Quality | Computational Cost | Epistemic vs. Aleatoric |
|---|---|---|---|
| Monte Carlo Dropout | Moderate | Low (N forward passes) | Both (mixed) |
| Deep Ensembles | High | High (N models) | Both (separated) |
| Gaussian Process | High | High for large data | Epistemic |
| MCMC BNN | Highest | Very high | Both |
| Temperature scaling | Calibration only | Very low | Neither (post-hoc) |
Failure modes: Misspecified priors can dominate when data is sparse, leading to incorrect posteriors. GP regression scales as O(n³) in data points — intractable at scale without sparse approximations. Monte Carlo Dropout underestimates uncertainty on OOD inputs. Deep ensembles are expensive to train and serve. Overconfident predictions from poorly calibrated frequentist models are the most common real-world failure.
Evaluating
Calibration is the key evaluation criterion:
- Reliability diagrams: bin predictions by confidence, plot mean confidence vs. actual accuracy per bin. Perfect calibration = diagonal line.
- Expected Calibration Error (ECE): weighted average of calibration error across bins.
- Brier score: proper scoring rule combining accuracy and calibration.
- Negative log-likelihood (NLL): proper scoring rule penalizing overconfident wrong predictions. Expert practitioners evaluate calibration separately on in-distribution and OOD data — models are often well-calibrated in-distribution but dramatically overconfident on OOD inputs.
Creating
Designing a calibrated Bayesian ML pipeline:
- For regression with small-medium data (<10k): use GP with RBF or Matérn kernel — exact calibration.
- For classification at scale: train deep ensemble of 5 models — strong calibration, high cost.
- For single-model practical calibration: train standard model, apply temperature scaling on validation set.
- For hyperparameter tuning: replace grid search with Bayesian optimization (Optuna TPE or BoTorch GP-BO).
- Monitor calibration in production: track reliability diagrams weekly and alert on ECE degradation.