Bayesian Ml: Difference between revisions

From BloomWiki
Jump to navigation Jump to search
BloomWiki: Bayesian Ml
BloomWiki: Bayesian Ml
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
{{BloomIntro}}
Bayesian machine learning is a probabilistic approach to AI that treats model parameters and predictions as probability distributions rather than fixed point estimates. Instead of asking "what is the best single model?" Bayesian methods ask "what is the distribution over models given the data?" This enables principled uncertainty quantification, systematic incorporation of prior knowledge, and coherent updating of beliefs as new data arrives. Bayesian methods are foundational to Gaussian processes, probabilistic programming, active learning, and Bayesian optimization — and increasingly relevant as AI systems are deployed in high-stakes settings that demand calibrated confidence estimates.
Bayesian machine learning is a probabilistic approach to AI that treats model parameters and predictions as probability distributions rather than fixed point estimates. Instead of asking "what is the best single model?" Bayesian methods ask "what is the distribution over models given the data?" This enables principled uncertainty quantification, systematic incorporation of prior knowledge, and coherent updating of beliefs as new data arrives. Bayesian methods are foundational to Gaussian processes, probabilistic programming, active learning, and Bayesian optimization — and increasingly relevant as AI systems are deployed in high-stakes settings that demand calibrated confidence estimates.
</div>


== Remembering ==
__TOC__
 
<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Prior distribution''' — A probability distribution over model parameters encoding beliefs before seeing any data; P(θ).
* '''Prior distribution''' — A probability distribution over model parameters encoding beliefs before seeing any data; P(θ).
* '''Likelihood''' — The probability of observing the data given model parameters; P(D|θ).
* '''Likelihood''' — The probability of observing the data given model parameters; P(D|θ).
Line 18: Line 23:
* '''Epistemic uncertainty''' — Uncertainty due to lack of knowledge (data); can be reduced with more data.
* '''Epistemic uncertainty''' — Uncertainty due to lack of knowledge (data); can be reduced with more data.
* '''Aleatoric uncertainty''' — Irreducible uncertainty from inherent randomness in the data-generating process.
* '''Aleatoric uncertainty''' — Irreducible uncertainty from inherent randomness in the data-generating process.
</div>


== Understanding ==
<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
The Bayesian framework offers a coherent solution to a fundamental problem: how should we update our beliefs given evidence? The answer is Bayes' theorem: start with a prior, multiply by the likelihood of the observed data, and normalize to get the posterior.
The Bayesian framework offers a coherent solution to a fundamental problem: how should we update our beliefs given evidence? The answer is Bayes' theorem: start with a prior, multiply by the likelihood of the observed data, and normalize to get the posterior.


Line 29: Line 36:


'''The computational challenge''': Computing exact posteriors requires integrating over all parameters, which is intractable for neural networks. Solutions: MCMC (exact but slow), variational inference (approximate but fast), Laplace approximation (quadratic approximation around MAP estimate), Monte Carlo Dropout (practical approximation).
'''The computational challenge''': Computing exact posteriors requires integrating over all parameters, which is intractable for neural networks. Solutions: MCMC (exact but slow), variational inference (approximate but fast), Laplace approximation (quadratic approximation around MAP estimate), Monte Carlo Dropout (practical approximation).
</div>


== Applying ==
<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Gaussian process regression with uncertainty bounds:'''
'''Gaussian process regression with uncertainty bounds:'''
<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
Line 78: Line 87:
: '''Probabilistic programming''' → PyMC, Stan, Pyro, NumPyro
: '''Probabilistic programming''' → PyMC, Stan, Pyro, NumPyro
: '''Production calibration''' → Temperature scaling (post-hoc), Platt scaling
: '''Production calibration''' → Temperature scaling (post-hoc), Platt scaling
</div>


== Analyzing ==
<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
{| class="wikitable"
|+ Uncertainty Estimation Methods
|+ Uncertainty Estimation Methods
Line 96: Line 107:


'''Failure modes''': Misspecified priors can dominate when data is sparse, leading to incorrect posteriors. GP regression scales as O(n³) in data points — intractable at scale without sparse approximations. Monte Carlo Dropout underestimates uncertainty on OOD inputs. Deep ensembles are expensive to train and serve. Overconfident predictions from poorly calibrated frequentist models are the most common real-world failure.
'''Failure modes''': Misspecified priors can dominate when data is sparse, leading to incorrect posteriors. GP regression scales as O(n³) in data points — intractable at scale without sparse approximations. Monte Carlo Dropout underestimates uncertainty on OOD inputs. Deep ensembles are expensive to train and serve. Overconfident predictions from poorly calibrated frequentist models are the most common real-world failure.
</div>


== Evaluating ==
<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Calibration is the key evaluation criterion: (1) '''Reliability diagrams''': bin predictions by confidence, plot mean confidence vs. actual accuracy per bin. Perfect calibration = diagonal line. (2) '''Expected Calibration Error (ECE)''': weighted average of calibration error across bins. (3) '''Brier score''': proper scoring rule combining accuracy and calibration. (4) '''Negative log-likelihood (NLL)''': proper scoring rule penalizing overconfident wrong predictions. Expert practitioners evaluate calibration separately on in-distribution and OOD data — models are often well-calibrated in-distribution but dramatically overconfident on OOD inputs.
== <span style="color: #FFFFFF;">Evaluating</span> ==
Calibration is the key evaluation criterion:
# '''Reliability diagrams''': bin predictions by confidence, plot mean confidence vs. actual accuracy per bin. Perfect calibration = diagonal line.
# '''Expected Calibration Error (ECE)''': weighted average of calibration error across bins.
# '''Brier score''': proper scoring rule combining accuracy and calibration.
# '''Negative log-likelihood (NLL)''': proper scoring rule penalizing overconfident wrong predictions. Expert practitioners evaluate calibration separately on in-distribution and OOD data — models are often well-calibrated in-distribution but dramatically overconfident on OOD inputs.
</div>


== Creating ==
<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Designing a calibrated Bayesian ML pipeline: (1) For regression with small-medium data (<10k): use GP with RBF or Matérn kernel — exact calibration. (2) For classification at scale: train deep ensemble of 5 models — strong calibration, high cost. (3) For single-model practical calibration: train standard model, apply temperature scaling on validation set. (4) For hyperparameter tuning: replace grid search with Bayesian optimization (Optuna TPE or BoTorch GP-BO). (5) Monitor calibration in production: track reliability diagrams weekly and alert on ECE degradation.
== <span style="color: #FFFFFF;">Creating</span> ==
Designing a calibrated Bayesian ML pipeline:
# For regression with small-medium data (<10k): use GP with RBF or Matérn kernel — exact calibration.
# For classification at scale: train deep ensemble of 5 models — strong calibration, high cost.
# For single-model practical calibration: train standard model, apply temperature scaling on validation set.
# For hyperparameter tuning: replace grid search with Bayesian optimization (Optuna TPE or BoTorch GP-BO).
# Monitor calibration in production: track reliability diagrams weekly and alert on ECE degradation.


[[Category:Artificial Intelligence]]
[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:Machine Learning]]
[[Category:Bayesian Machine Learning]]
[[Category:Bayesian Machine Learning]]
</div>

Latest revision as of 01:48, 25 April 2026

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Bayesian machine learning is a probabilistic approach to AI that treats model parameters and predictions as probability distributions rather than fixed point estimates. Instead of asking "what is the best single model?" Bayesian methods ask "what is the distribution over models given the data?" This enables principled uncertainty quantification, systematic incorporation of prior knowledge, and coherent updating of beliefs as new data arrives. Bayesian methods are foundational to Gaussian processes, probabilistic programming, active learning, and Bayesian optimization — and increasingly relevant as AI systems are deployed in high-stakes settings that demand calibrated confidence estimates.

Remembering[edit]

  • Prior distribution — A probability distribution over model parameters encoding beliefs before seeing any data; P(θ).
  • Likelihood — The probability of observing the data given model parameters; P(D|θ).
  • Posterior distribution — Updated beliefs about parameters after observing data; P(θ|D) ∝ P(D|θ)P(θ).
  • Bayes' theorem — P(θ|D) = P(D|θ)P(θ) / P(D); the foundation of Bayesian inference.
  • Marginal likelihood (evidence) — P(D) = ∫ P(D|θ)P(θ)dθ; normalizing constant; used for model selection.
  • Posterior predictive — Predictions averaged over the posterior: P(y|x, D) = ∫ P(y|x, θ)P(θ|D)dθ.
  • Conjugate prior — A prior whose posterior has the same functional form; enables closed-form updates.
  • MCMC (Markov Chain Monte Carlo) — A family of sampling algorithms for approximating intractable posterior distributions.
  • Variational inference (VI) — Approximates the posterior with a simpler distribution by minimizing KL divergence; faster than MCMC.
  • Gaussian process (GP) — A non-parametric Bayesian model defining a prior over functions; exactly tractable for regression.
  • Bayesian neural network (BNN) — A neural network with distributions over weights, enabling uncertainty estimation.
  • Monte Carlo Dropout — Approximates Bayesian uncertainty by running inference with dropout enabled at test time.
  • Bayesian optimization — Using a probabilistic surrogate model (usually GP) to optimize expensive black-box functions.
  • Epistemic uncertainty — Uncertainty due to lack of knowledge (data); can be reduced with more data.
  • Aleatoric uncertainty — Irreducible uncertainty from inherent randomness in the data-generating process.

Understanding[edit]

The Bayesian framework offers a coherent solution to a fundamental problem: how should we update our beliefs given evidence? The answer is Bayes' theorem: start with a prior, multiply by the likelihood of the observed data, and normalize to get the posterior.

The frequentist vs. Bayesian divide: Frequentist ML (standard deep learning) treats model parameters as fixed unknowns estimated from data. Bayesian ML treats parameters as random variables with distributions — capturing our uncertainty about the true values. A single point estimate (e.g., maximum likelihood) discards this uncertainty information.

Why uncertainty matters: A model that outputs "99% confidence: benign tumor" should actually be correct 99% of the time, not just have the highest logit. In medical AI, an overconfident prediction is dangerous. Bayesian methods provide principled calibration.

Gaussian processes: A GP is a distribution over functions. Given a kernel (covariance function) and training data, GPs provide exact posterior distributions over function values, including uncertainty bounds. GPs are the backbone of Bayesian optimization: fit a GP to evaluated function values, use the posterior to identify where to evaluate next (acquisition function).

The computational challenge: Computing exact posteriors requires integrating over all parameters, which is intractable for neural networks. Solutions: MCMC (exact but slow), variational inference (approximate but fast), Laplace approximation (quadratic approximation around MAP estimate), Monte Carlo Dropout (practical approximation).

Applying[edit]

Gaussian process regression with uncertainty bounds: <syntaxhighlight lang="python"> import numpy as np import gpytorch import torch

  1. Training data

train_x = torch.linspace(0, 1, 20) train_y = torch.sin(train_x * 2 * np.pi) + torch.randn_like(train_x) * 0.1

class ExactGPModel(gpytorch.models.ExactGP):

   def __init__(self, train_x, train_y, likelihood):
       super().__init__(train_x, train_y, likelihood)
       self.mean = gpytorch.means.ConstantMean()
       self.covar = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
   def forward(self, x):
       return gpytorch.distributions.MultivariateNormal(
           self.mean(x), self.covar(x))

likelihood = gpytorch.likelihoods.GaussianLikelihood() model = ExactGPModel(train_x, train_y, likelihood)

  1. Train GP hyperparameters

model.train(); likelihood.train() optimizer = torch.optim.Adam(model.parameters(), lr=0.1) mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model) for _ in range(100):

   optimizer.zero_grad()
   (-mll(model(train_x), train_y)).backward()
   optimizer.step()
  1. Predict with uncertainty

model.eval(); likelihood.eval() with torch.no_grad(), gpytorch.settings.fast_pred_var():

   test_x = torch.linspace(0, 1, 200)
   pred = likelihood(model(test_x))
   mean = pred.mean
   lower, upper = pred.confidence_region()  # 2σ bounds

</syntaxhighlight>

Bayesian ML tool selection
Regression with uncertainty → Gaussian processes (GPyTorch, scikit-learn GP)
Hyperparameter optimization → Bayesian optimization (Optuna, BoTorch, SMAC)
BNN approximation → Monte Carlo Dropout, Laplace Redux, SWAG
Probabilistic programming → PyMC, Stan, Pyro, NumPyro
Production calibration → Temperature scaling (post-hoc), Platt scaling

Analyzing[edit]

Uncertainty Estimation Methods
Method Calibration Quality Computational Cost Epistemic vs. Aleatoric
Monte Carlo Dropout Moderate Low (N forward passes) Both (mixed)
Deep Ensembles High High (N models) Both (separated)
Gaussian Process High High for large data Epistemic
MCMC BNN Highest Very high Both
Temperature scaling Calibration only Very low Neither (post-hoc)

Failure modes: Misspecified priors can dominate when data is sparse, leading to incorrect posteriors. GP regression scales as O(n³) in data points — intractable at scale without sparse approximations. Monte Carlo Dropout underestimates uncertainty on OOD inputs. Deep ensembles are expensive to train and serve. Overconfident predictions from poorly calibrated frequentist models are the most common real-world failure.

Evaluating[edit]

Calibration is the key evaluation criterion:

  1. Reliability diagrams: bin predictions by confidence, plot mean confidence vs. actual accuracy per bin. Perfect calibration = diagonal line.
  2. Expected Calibration Error (ECE): weighted average of calibration error across bins.
  3. Brier score: proper scoring rule combining accuracy and calibration.
  4. Negative log-likelihood (NLL): proper scoring rule penalizing overconfident wrong predictions. Expert practitioners evaluate calibration separately on in-distribution and OOD data — models are often well-calibrated in-distribution but dramatically overconfident on OOD inputs.

Creating[edit]

Designing a calibrated Bayesian ML pipeline:

  1. For regression with small-medium data (<10k): use GP with RBF or Matérn kernel — exact calibration.
  2. For classification at scale: train deep ensemble of 5 models — strong calibration, high cost.
  3. For single-model practical calibration: train standard model, apply temperature scaling on validation set.
  4. For hyperparameter tuning: replace grid search with Bayesian optimization (Optuna TPE or BoTorch GP-BO).
  5. Monitor calibration in production: track reliability diagrams weekly and alert on ECE degradation.