Active Learning
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Active learning is a machine learning paradigm where the learning algorithm actively queries an oracle (usually a human expert) to label the most informative examples, rather than learning passively from a pre-labeled dataset. The key insight: not all labeled examples are equally valuable for learning. By strategically selecting which examples to label, active learning can achieve comparable model performance with dramatically fewer labels — often 10–20× fewer — dramatically reducing the annotation cost for building supervised ML systems.
Remembering
- Active learning — A machine learning approach where the model selects its own training examples to query for labels.
- Oracle — The entity providing labels in active learning; typically a human expert.
- Query strategy — The criterion used to select which unlabeled examples to request labels for.
- Uncertainty sampling — Query the example the model is least certain about (most uncertain prediction).
- Query by committee — Train multiple models (committee); query examples where they disagree most.
- Core-set selection — Select examples that best represent the overall data distribution geometrically.
- Expected model change — Query the example that would cause the largest change in the model if labeled.
- Expected error reduction — Query the example that would most reduce the model's expected error.
- Batch active learning — Selecting a batch of K examples at once (more practical than one-at-a-time).
- Pool-based sampling — Select from a pool of unlabeled examples (most common active learning setting).
- Stream-based sampling — Examples arrive sequentially; decide whether to query each one.
- Cold start — The initial problem in active learning: model is untrained, so uncertainty estimates are unreliable.
- Annotation budget — The total number of labels the user is willing to provide.
- Label efficiency — The ratio of performance improvement to annotation cost; active learning maximizes this.
Understanding
In most ML applications, labeling data is the bottleneck — not the model or the compute. A radiologist might take 5 minutes to annotate a CT scan; annotating 100,000 scans would require years of expert time. Active learning addresses this directly by asking: "which 1000 scans should we label to get the most accurate model?"
Why random sampling is suboptimal: If 98% of images in a dataset are cats and 2% are rare diseases, random sampling wastes most of the annotation budget on cats that the model already handles well. Active learning focuses the budget on examples near the decision boundary or in underrepresented regions.
Uncertainty sampling: The simplest and most widely used strategy. After training on the current labeled set, apply the model to all unlabeled examples. Select the examples where the model is least confident (e.g., predicted probability closest to 0.5 for binary classification). The intuition: these are the examples the model is currently "on the fence" about — labeling them provides the most information.
Core-set selection: Instead of uncertainty, select examples that are geographically most distant from already-labeled examples in the feature space. This ensures the labeled set covers the full data distribution — addressing the cold-start problem that uncertainty sampling faces (before training, all uncertainty estimates are uninformative).
The exploration-exploitation tension: Uncertainty sampling exploits model knowledge to label informative examples, but can get stuck labeling outliers or noise (uncertain examples are sometimes uncertain because they're anomalies, not informative boundary cases). Core-set ensures exploration of the data distribution. BADGE (Batch Active Learning by Diverse Gradient Embeddings) combines both by selecting a diverse, high-gradient batch.
Applying
Active learning loop for image classification: <syntaxhighlight lang="python"> import numpy as np import torch import torch.nn as nn from torch.utils.data import DataLoader, Subset from torchvision import models
class ActiveLearner:
def __init__(self, model, unlabeled_pool, labeled_indices, device='cuda'):
self.model = model.to(device)
self.unlabeled_pool = unlabeled_pool
self.labeled_indices = list(labeled_indices)
self.unlabeled_indices = [i for i in range(len(unlabeled_pool))
if i not in set(labeled_indices)]
self.device = device
def train(self, labels_dict, epochs=5):
"""Train on currently labeled data."""
labeled_ds = Subset(self.unlabeled_pool, self.labeled_indices)
loader = DataLoader(labeled_ds, batch_size=32, shuffle=True)
opt = torch.optim.Adam(self.model.parameters(), lr=1e-4)
self.model.train()
for _ in range(epochs):
for X, _ in loader:
y = torch.tensor([labels_dict[i] for i in self.labeled_indices
if i < len(X)]).to(self.device)
X = X.to(self.device)
loss = nn.CrossEntropyLoss()(self.model(X), y)
opt.zero_grad(); loss.backward(); opt.step()
def query(self, n_query=10, strategy='uncertainty') -> list:
"""Select n_query examples to label next."""
unlabeled_ds = Subset(self.unlabeled_pool, self.unlabeled_indices)
loader = DataLoader(unlabeled_ds, batch_size=64)
self.model.eval()
all_probs = []
with torch.no_grad():
for X, _ in loader:
probs = torch.softmax(self.model(X.to(self.device)), dim=1).cpu()
all_probs.append(probs)
all_probs = torch.cat(all_probs)
if strategy == 'uncertainty':
# Least confidence: highest entropy
entropy = -(all_probs * all_probs.log().clamp(-100, 0)).sum(1)
query_local_idx = entropy.topk(n_query).indices.tolist()
elif strategy == 'margin':
# Smallest gap between top-2 predictions
top2 = all_probs.topk(2, dim=1).values
margin = top2[:, 0] - top2[:, 1]
query_local_idx = margin.topk(n_query, largest=False).indices.tolist()
return [self.unlabeled_indices[i] for i in query_local_idx]
def label_and_add(self, indices, labels_dict):
"""Add newly labeled examples to labeled set."""
self.labeled_indices.extend(indices)
self.unlabeled_indices = [i for i in self.unlabeled_indices if i not in set(indices)]
</syntaxhighlight>
- Active learning strategy guide
- Small budget (<100 labels) → Core-set (diversity), random seeding first 10
- Medium budget → BADGE (diverse + informative), uncertainty sampling
- Large unlabeled pool → Uncertainty sampling (fast); approximate with embeddings
- Batch selection → BADGE, k-means++ on uncertain examples
- NLP tasks → Uncertainty + semantic diversity filtering; avoid near-duplicate queries
- Medical/scientific → Core-set + expert-in-the-loop revision cycles
Analyzing
| Strategy | Label Efficiency | Computational Cost | Cold Start Performance |
|---|---|---|---|
| Random sampling | Baseline | None | Good (diverse) |
| Uncertainty sampling | High | Low | Poor (overconfident early) |
| Core-set selection | High | Medium | Good (covers distribution) |
| Query by committee | High | High | Moderate |
| BADGE | Very high | Medium | Good |
| Expected model change | Very high | Very high | Good |
Failure modes: Outlier bias — uncertainty sampling queries outliers (intrinsically ambiguous or mislabeled examples) that don't help generalization. Cold-start overconfidence — early models make confident but wrong predictions, querying non-informative examples. Distribution mismatch — labeled set diverges from test distribution. Oracle disagreement — human annotators disagree on queried examples, providing noisy labels.
Evaluating
Active learning evaluation: (1) Learning curve: plot accuracy vs. number of labeled examples; compare to random sampling baseline at each budget. (2) Label efficiency: how many labels are needed to achieve X% of fully-supervised performance? (3) Test on held-out set: evaluation must use examples never selected by active learning to avoid selection bias. (4) Annotation cost: measure actual oracle time per query; some strategies select harder-to-annotate examples. (5) Stability: run 5 trials with different random seeds; active learning can have high variance.
Creating
Designing a production active learning pipeline: (1) Start random: label 50–100 random examples to bootstrap. (2) Train initial model: establish a baseline. (3) Query: use BADGE or uncertainty sampling to select next batch (10–50 examples). (4) Annotate: route selected examples to human annotators via labeling tool (Label Studio, Scale AI). (5) Retrain: retrain model on expanded labeled set. (6) Evaluate: measure performance on held-out test set; continue if below target. (7) Stopping rule: stop when performance gain per label drops below threshold — the marginal return on annotation investment.