Privacy Ml

From BloomWiki
Revision as of 01:56, 25 April 2026 by Wordpad (talk | contribs) (BloomWiki: Privacy Ml)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Privacy-preserving machine learning (PPML) is the study and practice of training and deploying ML models in ways that protect the privacy of the underlying data. As AI systems increasingly train on sensitive personal data — medical records, financial histories, behavioral data — techniques for learning from data without exposing individual records have become essential. PPML encompasses differential privacy (mathematical privacy guarantees), federated learning (training without centralizing data), secure multi-party computation (collaborating without sharing raw data), and homomorphic encryption (computing on encrypted data).

Remembering[edit]

  • Differential privacy (DP) — A mathematical guarantee that the inclusion or exclusion of any single record makes little difference to the output, bounded by parameter ε.
  • Epsilon (ε) in DP — The privacy budget: smaller ε = stronger privacy guarantee but more noise added. Typical values: ε=1–10.
  • Noise mechanism — Adding calibrated random noise to protect privacy: Laplace mechanism, Gaussian mechanism.
  • DP-SGD (Differentially Private SGD) — Training neural networks with differential privacy by clipping and noising gradients.
  • Federated learning — Training on data distributed across many devices without centralizing the raw data; only model updates are shared.
  • Secure aggregation — Aggregating federated model updates without the server seeing individual updates (using cryptographic protocols).
  • Homomorphic encryption (HE) — Cryptographic technique allowing computation on encrypted data without decryption.
  • Secure Multi-Party Computation (SMPC) — Multiple parties jointly compute a function on their private inputs without revealing those inputs.
  • Membership inference attack — An attack testing whether a specific record was in the training data; measures privacy leakage.
  • Model inversion attack — Reconstructing training data from a trained model's outputs; a privacy risk.
  • Data minimization — Collecting and using only the minimum data necessary; a GDPR principle.
  • Synthetic data (privacy) — Generating realistic but non-personal data to share instead of real records.
  • k-anonymity — A data protection model where each record is indistinguishable from at least k-1 others.
  • Privacy budget — The total privacy expenditure across multiple DP queries or training steps; must be managed carefully.

Understanding[edit]

The core problem: ML models memorize training data. This is well-documented: models can reveal training examples when queried appropriately. This creates serious privacy risks when training data includes medical records, financial transactions, or personal communications.

Differential Privacy (DP) provides a rigorous mathematical definition of privacy. A mechanism M satisfies (ε, δ)-differential privacy if for any two adjacent datasets D and D' (differing by one record), and any output S: P(M(D) ∈ S) ≤ e^ε · P(M(D') ∈ S) + δ. This means the output distribution is nearly identical whether or not any individual's data was included — their privacy is protected regardless of what the attacker knows.

DP-SGD is the standard technique for differentially private deep learning (Abadi et al., 2016):

  1. Compute gradient for each sample individually.
  2. Clip each gradient to bounded L2 norm (prevents any single example from having too much influence).
  3. Add Gaussian noise calibrated to the privacy budget.
  4. Average the noisy, clipped gradients and update model. The cost: additional noise degrades model utility, especially for complex models and small datasets.

Federated learning keeps data on-device. Google's Gboard keyboard predicts the next word by training on user input directly on phones; only encrypted gradient updates are sent to a central server, aggregated, and used to update the global model. No raw text ever leaves the device.

The tradeoff landscape: Strong privacy → more noise → lower model accuracy. There is a fundamental tension between privacy and utility. The PPML field works to close this gap, but it cannot be eliminated entirely with current techniques.

Applying[edit]

Differentially private model training with Opacus: <syntaxhighlight lang="python"> import torch import torch.nn as nn from opacus import PrivacyEngine from opacus.validators import ModuleValidator from torch.utils.data import DataLoader

model = nn.Sequential(

   nn.Linear(784, 256), nn.ReLU(),
   nn.Linear(256, 10)

)

  1. Validate and fix model for DP-SGD compatibility

model = ModuleValidator.fix(model)

optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

  1. Attach privacy engine

privacy_engine = PrivacyEngine() model, optimizer, trainloader = privacyengine.makeprivatewith_epsilon(

   module=model,
   optimizer=optimizer,
   dataloader=DataLoader(traindataset, batch_size=256),
   epochs=20,
   target_epsilon=1.0,   # Privacy budget ε (lower = stronger privacy)
   target_delta=1e-5,    # δ — small probability of privacy violation
   maxgradnorm=1.0,    # Gradient clipping norm

)

  1. Training loop (same as standard; Opacus handles DP automatically)

for epoch in range(20):

   for X, y in train_loader:
       optimizer.zero_grad()
       output = model(X)
       loss = nn.CrossEntropyLoss()(output, y)
       loss.backward()
       optimizer.step()

epsilon = privacyengine.getepsilon(delta=1e-5) print(f"Final privacy budget spent: ε = {epsilon:.2f}")