Active Learning: Difference between revisions

From BloomWiki
Jump to navigation Jump to search
BloomWiki: Active Learning
 
BloomWiki: Active Learning
Line 21: Line 21:
In most ML applications, labeling data is the bottleneck — not the model or the compute. A radiologist might take 5 minutes to annotate a CT scan; annotating 100,000 scans would require years of expert time. Active learning addresses this directly by asking: "which 1000 scans should we label to get the most accurate model?"
In most ML applications, labeling data is the bottleneck — not the model or the compute. A radiologist might take 5 minutes to annotate a CT scan; annotating 100,000 scans would require years of expert time. Active learning addresses this directly by asking: "which 1000 scans should we label to get the most accurate model?"


**Why random sampling is suboptimal**: If 98% of images in a dataset are cats and 2% are rare diseases, random sampling wastes most of the annotation budget on cats that the model already handles well. Active learning focuses the budget on examples near the decision boundary or in underrepresented regions.
'''Why random sampling is suboptimal''': If 98% of images in a dataset are cats and 2% are rare diseases, random sampling wastes most of the annotation budget on cats that the model already handles well. Active learning focuses the budget on examples near the decision boundary or in underrepresented regions.


**Uncertainty sampling**: The simplest and most widely used strategy. After training on the current labeled set, apply the model to all unlabeled examples. Select the examples where the model is least confident (e.g., predicted probability closest to 0.5 for binary classification). The intuition: these are the examples the model is currently "on the fence" about — labeling them provides the most information.
'''Uncertainty sampling''': The simplest and most widely used strategy. After training on the current labeled set, apply the model to all unlabeled examples. Select the examples where the model is least confident (e.g., predicted probability closest to 0.5 for binary classification). The intuition: these are the examples the model is currently "on the fence" about — labeling them provides the most information.


**Core-set selection**: Instead of uncertainty, select examples that are geographically most distant from already-labeled examples in the feature space. This ensures the labeled set covers the full data distribution — addressing the cold-start problem that uncertainty sampling faces (before training, all uncertainty estimates are uninformative).
'''Core-set selection''': Instead of uncertainty, select examples that are geographically most distant from already-labeled examples in the feature space. This ensures the labeled set covers the full data distribution — addressing the cold-start problem that uncertainty sampling faces (before training, all uncertainty estimates are uninformative).


**The exploration-exploitation tension**: Uncertainty sampling exploits model knowledge to label informative examples, but can get stuck labeling outliers or noise (uncertain examples are sometimes uncertain because they're anomalies, not informative boundary cases). Core-set ensures exploration of the data distribution. BADGE (Batch Active Learning by Diverse Gradient Embeddings) combines both by selecting a diverse, high-gradient batch.
'''The exploration-exploitation tension''': Uncertainty sampling exploits model knowledge to label informative examples, but can get stuck labeling outliers or noise (uncertain examples are sometimes uncertain because they're anomalies, not informative boundary cases). Core-set ensures exploration of the data distribution. BADGE (Batch Active Learning by Diverse Gradient Embeddings) combines both by selecting a diverse, high-gradient batch.


== Applying ==
== Applying ==
Line 120: Line 120:


== Evaluating ==
== Evaluating ==
Active learning evaluation: (1) **Learning curve**: plot accuracy vs. number of labeled examples; compare to random sampling baseline at each budget. (2) **Label efficiency**: how many labels are needed to achieve X% of fully-supervised performance? (3) **Test on held-out set**: evaluation must use examples never selected by active learning to avoid selection bias. (4) **Annotation cost**: measure actual oracle time per query; some strategies select harder-to-annotate examples. (5) **Stability**: run 5 trials with different random seeds; active learning can have high variance.
Active learning evaluation: (1) '''Learning curve''': plot accuracy vs. number of labeled examples; compare to random sampling baseline at each budget. (2) '''Label efficiency''': how many labels are needed to achieve X% of fully-supervised performance? (3) '''Test on held-out set''': evaluation must use examples never selected by active learning to avoid selection bias. (4) '''Annotation cost''': measure actual oracle time per query; some strategies select harder-to-annotate examples. (5) '''Stability''': run 5 trials with different random seeds; active learning can have high variance.


== Creating ==
== Creating ==
Designing a production active learning pipeline: (1) **Start random**: label 50–100 random examples to bootstrap. (2) **Train initial model**: establish a baseline. (3) **Query**: use BADGE or uncertainty sampling to select next batch (10–50 examples). (4) **Annotate**: route selected examples to human annotators via labeling tool (Label Studio, Scale AI). (5) **Retrain**: retrain model on expanded labeled set. (6) **Evaluate**: measure performance on held-out test set; continue if below target. (7) **Stopping rule**: stop when performance gain per label drops below threshold — the marginal return on annotation investment.
Designing a production active learning pipeline: (1) '''Start random''': label 50–100 random examples to bootstrap. (2) '''Train initial model''': establish a baseline. (3) '''Query''': use BADGE or uncertainty sampling to select next batch (10–50 examples). (4) '''Annotate''': route selected examples to human annotators via labeling tool (Label Studio, Scale AI). (5) '''Retrain''': retrain model on expanded labeled set. (6) '''Evaluate''': measure performance on held-out test set; continue if below target. (7) '''Stopping rule''': stop when performance gain per label drops below threshold — the marginal return on annotation investment.


[[Category:Artificial Intelligence]]
[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:Machine Learning]]
[[Category:Active Learning]]
[[Category:Active Learning]]

Revision as of 14:20, 23 April 2026

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Active learning is a machine learning paradigm where the learning algorithm actively queries an oracle (usually a human expert) to label the most informative examples, rather than learning passively from a pre-labeled dataset. The key insight: not all labeled examples are equally valuable for learning. By strategically selecting which examples to label, active learning can achieve comparable model performance with dramatically fewer labels — often 10–20× fewer — dramatically reducing the annotation cost for building supervised ML systems.

Remembering

  • Active learning — A machine learning approach where the model selects its own training examples to query for labels.
  • Oracle — The entity providing labels in active learning; typically a human expert.
  • Query strategy — The criterion used to select which unlabeled examples to request labels for.
  • Uncertainty sampling — Query the example the model is least certain about (most uncertain prediction).
  • Query by committee — Train multiple models (committee); query examples where they disagree most.
  • Core-set selection — Select examples that best represent the overall data distribution geometrically.
  • Expected model change — Query the example that would cause the largest change in the model if labeled.
  • Expected error reduction — Query the example that would most reduce the model's expected error.
  • Batch active learning — Selecting a batch of K examples at once (more practical than one-at-a-time).
  • Pool-based sampling — Select from a pool of unlabeled examples (most common active learning setting).
  • Stream-based sampling — Examples arrive sequentially; decide whether to query each one.
  • Cold start — The initial problem in active learning: model is untrained, so uncertainty estimates are unreliable.
  • Annotation budget — The total number of labels the user is willing to provide.
  • Label efficiency — The ratio of performance improvement to annotation cost; active learning maximizes this.

Understanding

In most ML applications, labeling data is the bottleneck — not the model or the compute. A radiologist might take 5 minutes to annotate a CT scan; annotating 100,000 scans would require years of expert time. Active learning addresses this directly by asking: "which 1000 scans should we label to get the most accurate model?"

Why random sampling is suboptimal: If 98% of images in a dataset are cats and 2% are rare diseases, random sampling wastes most of the annotation budget on cats that the model already handles well. Active learning focuses the budget on examples near the decision boundary or in underrepresented regions.

Uncertainty sampling: The simplest and most widely used strategy. After training on the current labeled set, apply the model to all unlabeled examples. Select the examples where the model is least confident (e.g., predicted probability closest to 0.5 for binary classification). The intuition: these are the examples the model is currently "on the fence" about — labeling them provides the most information.

Core-set selection: Instead of uncertainty, select examples that are geographically most distant from already-labeled examples in the feature space. This ensures the labeled set covers the full data distribution — addressing the cold-start problem that uncertainty sampling faces (before training, all uncertainty estimates are uninformative).

The exploration-exploitation tension: Uncertainty sampling exploits model knowledge to label informative examples, but can get stuck labeling outliers or noise (uncertain examples are sometimes uncertain because they're anomalies, not informative boundary cases). Core-set ensures exploration of the data distribution. BADGE (Batch Active Learning by Diverse Gradient Embeddings) combines both by selecting a diverse, high-gradient batch.

Applying

Active learning loop for image classification: <syntaxhighlight lang="python"> import numpy as np import torch import torch.nn as nn from torch.utils.data import DataLoader, Subset from torchvision import models

class ActiveLearner:

   def __init__(self, model, unlabeled_pool, labeled_indices, device='cuda'):
       self.model = model.to(device)
       self.unlabeled_pool = unlabeled_pool
       self.labeled_indices = list(labeled_indices)
       self.unlabeled_indices = [i for i in range(len(unlabeled_pool))
                                 if i not in set(labeled_indices)]
       self.device = device
   def train(self, labels_dict, epochs=5):
       """Train on currently labeled data."""
       labeled_ds = Subset(self.unlabeled_pool, self.labeled_indices)
       loader = DataLoader(labeled_ds, batch_size=32, shuffle=True)
       opt = torch.optim.Adam(self.model.parameters(), lr=1e-4)
       self.model.train()
       for _ in range(epochs):
           for X, _ in loader:
               y = torch.tensor([labels_dict[i] for i in self.labeled_indices
                                 if i < len(X)]).to(self.device)
               X = X.to(self.device)
               loss = nn.CrossEntropyLoss()(self.model(X), y)
               opt.zero_grad(); loss.backward(); opt.step()
   def query(self, n_query=10, strategy='uncertainty') -> list:
       """Select n_query examples to label next."""
       unlabeled_ds = Subset(self.unlabeled_pool, self.unlabeled_indices)
       loader = DataLoader(unlabeled_ds, batch_size=64)
       self.model.eval()
       all_probs = []
       with torch.no_grad():
           for X, _ in loader:
               probs = torch.softmax(self.model(X.to(self.device)), dim=1).cpu()
               all_probs.append(probs)
       all_probs = torch.cat(all_probs)
       if strategy == 'uncertainty':
           # Least confidence: highest entropy
           entropy = -(all_probs * all_probs.log().clamp(-100, 0)).sum(1)
           query_local_idx = entropy.topk(n_query).indices.tolist()
       elif strategy == 'margin':
           # Smallest gap between top-2 predictions
           top2 = all_probs.topk(2, dim=1).values
           margin = top2[:, 0] - top2[:, 1]
           query_local_idx = margin.topk(n_query, largest=False).indices.tolist()
       return [self.unlabeled_indices[i] for i in query_local_idx]
   def label_and_add(self, indices, labels_dict):
       """Add newly labeled examples to labeled set."""
       self.labeled_indices.extend(indices)
       self.unlabeled_indices = [i for i in self.unlabeled_indices if i not in set(indices)]

</syntaxhighlight>

Active learning strategy guide
Small budget (<100 labels) → Core-set (diversity), random seeding first 10
Medium budget → BADGE (diverse + informative), uncertainty sampling
Large unlabeled pool → Uncertainty sampling (fast); approximate with embeddings
Batch selection → BADGE, k-means++ on uncertain examples
NLP tasks → Uncertainty + semantic diversity filtering; avoid near-duplicate queries
Medical/scientific → Core-set + expert-in-the-loop revision cycles

Analyzing

Active Learning Strategy Comparison
Strategy Label Efficiency Computational Cost Cold Start Performance
Random sampling Baseline None Good (diverse)
Uncertainty sampling High Low Poor (overconfident early)
Core-set selection High Medium Good (covers distribution)
Query by committee High High Moderate
BADGE Very high Medium Good
Expected model change Very high Very high Good

Failure modes: Outlier bias — uncertainty sampling queries outliers (intrinsically ambiguous or mislabeled examples) that don't help generalization. Cold-start overconfidence — early models make confident but wrong predictions, querying non-informative examples. Distribution mismatch — labeled set diverges from test distribution. Oracle disagreement — human annotators disagree on queried examples, providing noisy labels.

Evaluating

Active learning evaluation: (1) Learning curve: plot accuracy vs. number of labeled examples; compare to random sampling baseline at each budget. (2) Label efficiency: how many labels are needed to achieve X% of fully-supervised performance? (3) Test on held-out set: evaluation must use examples never selected by active learning to avoid selection bias. (4) Annotation cost: measure actual oracle time per query; some strategies select harder-to-annotate examples. (5) Stability: run 5 trials with different random seeds; active learning can have high variance.

Creating

Designing a production active learning pipeline: (1) Start random: label 50–100 random examples to bootstrap. (2) Train initial model: establish a baseline. (3) Query: use BADGE or uncertainty sampling to select next batch (10–50 examples). (4) Annotate: route selected examples to human annotators via labeling tool (Label Studio, Scale AI). (5) Retrain: retrain model on expanded labeled set. (6) Evaluate: measure performance on held-out test set; continue if below target. (7) Stopping rule: stop when performance gain per label drops below threshold — the marginal return on annotation investment.