Active Learning - Revision history

Wordpad: BloomWiki: Active Learning

2026-04-25T01:46:26Z

BloomWiki: Active Learning

← Older revision		Revision as of 01:46, 25 April 2026
Line 1:		Line 1:
			<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
	{{BloomIntro}}		{{BloomIntro}}
	Active learning is a machine learning paradigm where the learning algorithm actively queries an oracle (usually a human expert) to label the most informative examples, rather than learning passively from a pre-labeled dataset. The key insight: not all labeled examples are equally valuable for learning. By strategically selecting which examples to label, active learning can achieve comparable model performance with dramatically fewer labels — often 10–20× fewer — dramatically reducing the annotation cost for building supervised ML systems.		Active learning is a machine learning paradigm where the learning algorithm actively queries an oracle (usually a human expert) to label the most informative examples, rather than learning passively from a pre-labeled dataset. The key insight: not all labeled examples are equally valuable for learning. By strategically selecting which examples to label, active learning can achieve comparable model performance with dramatically fewer labels — often 10–20× fewer — dramatically reducing the annotation cost for building supervised ML systems.
			</div>

	== Remembering ==		__TOC__

			<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Remembering</span> ==
	* '''Active learning''' — A machine learning approach where the model selects its own training examples to query for labels.		* '''Active learning''' — A machine learning approach where the model selects its own training examples to query for labels.
	* '''Oracle''' — The entity providing labels in active learning; typically a human expert.		* '''Oracle''' — The entity providing labels in active learning; typically a human expert.
Line 17:		Line 22:
	* '''Annotation budget''' — The total number of labels the user is willing to provide.		* '''Annotation budget''' — The total number of labels the user is willing to provide.
	* '''Label efficiency''' — The ratio of performance improvement to annotation cost; active learning maximizes this.		* '''Label efficiency''' — The ratio of performance improvement to annotation cost; active learning maximizes this.
			</div>

	== Understanding ==		<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Understanding</span> ==
	In most ML applications, labeling data is the bottleneck — not the model or the compute. A radiologist might take 5 minutes to annotate a CT scan; annotating 100,000 scans would require years of expert time. Active learning addresses this directly by asking: "which 1000 scans should we label to get the most accurate model?"		In most ML applications, labeling data is the bottleneck — not the model or the compute. A radiologist might take 5 minutes to annotate a CT scan; annotating 100,000 scans would require years of expert time. Active learning addresses this directly by asking: "which 1000 scans should we label to get the most accurate model?"

Line 28:		Line 35:

	'''The exploration-exploitation tension''': Uncertainty sampling exploits model knowledge to label informative examples, but can get stuck labeling outliers or noise (uncertain examples are sometimes uncertain because they're anomalies, not informative boundary cases). Core-set ensures exploration of the data distribution. BADGE (Batch Active Learning by Diverse Gradient Embeddings) combines both by selecting a diverse, high-gradient batch.		'''The exploration-exploitation tension''': Uncertainty sampling exploits model knowledge to label informative examples, but can get stuck labeling outliers or noise (uncertain examples are sometimes uncertain because they're anomalies, not informative boundary cases). Core-set ensures exploration of the data distribution. BADGE (Batch Active Learning by Diverse Gradient Embeddings) combines both by selecting a diverse, high-gradient batch.
			</div>

	== Applying ==		<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Applying</span> ==
	'''Active learning loop for image classification:'''		'''Active learning loop for image classification:'''
	<syntaxhighlight lang="python">		<syntaxhighlight lang="python">
Line 98:		Line 107:
	: '''NLP tasks''' → Uncertainty + semantic diversity filtering; avoid near-duplicate queries		: '''NLP tasks''' → Uncertainty + semantic diversity filtering; avoid near-duplicate queries
	: '''Medical/scientific''' → Core-set + expert-in-the-loop revision cycles		: '''Medical/scientific''' → Core-set + expert-in-the-loop revision cycles
			</div>

	== Analyzing ==		<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Analyzing</span> ==
	{\| class="wikitable"		{\| class="wikitable"
	\|+ Active Learning Strategy Comparison		\|+ Active Learning Strategy Comparison
Line 118:		Line 129:

	'''Failure modes''': Outlier bias — uncertainty sampling queries outliers (intrinsically ambiguous or mislabeled examples) that don't help generalization. Cold-start overconfidence — early models make confident but wrong predictions, querying non-informative examples. Distribution mismatch — labeled set diverges from test distribution. Oracle disagreement — human annotators disagree on queried examples, providing noisy labels.		'''Failure modes''': Outlier bias — uncertainty sampling queries outliers (intrinsically ambiguous or mislabeled examples) that don't help generalization. Cold-start overconfidence — early models make confident but wrong predictions, querying non-informative examples. Distribution mismatch — labeled set diverges from test distribution. Oracle disagreement — human annotators disagree on queried examples, providing noisy labels.
			</div>

	== Evaluating ==		<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Evaluating</span> ==
	Active learning evaluation:		Active learning evaluation:
	# '''Learning curve''': plot accuracy vs. number of labeled examples; compare to random sampling baseline at each budget.		# '''Learning curve''': plot accuracy vs. number of labeled examples; compare to random sampling baseline at each budget.
Line 126:		Line 139:
	# '''Annotation cost''': measure actual oracle time per query; some strategies select harder-to-annotate examples.		# '''Annotation cost''': measure actual oracle time per query; some strategies select harder-to-annotate examples.
	# '''Stability''': run 5 trials with different random seeds; active learning can have high variance.		# '''Stability''': run 5 trials with different random seeds; active learning can have high variance.
			</div>

	== Creating ==		<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Creating</span> ==
	Designing a production active learning pipeline:		Designing a production active learning pipeline:
	# '''Start random''': label 50–100 random examples to bootstrap.		# '''Start random''': label 50–100 random examples to bootstrap.
Line 140:		Line 155:
	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Active Learning]]		[[Category:Active Learning]]
			</div>

Wordpad: BloomWiki: Active Learning

2026-04-23T14:36:08Z

BloomWiki: Active Learning

← Older revision		Revision as of 14:36, 23 April 2026
Line 120:		Line 120:

	== Evaluating ==		== Evaluating ==
	Active learning evaluation: ~~(1)~~ '''Learning curve''': plot accuracy vs. number of labeled examples; compare to random sampling baseline at each budget. ~~(2)~~ '''Label efficiency''': how many labels are needed to achieve X% of fully-supervised performance? ~~(3)~~ '''Test on held-out set''': evaluation must use examples never selected by active learning to avoid selection bias. ~~(4)~~ '''Annotation cost''': measure actual oracle time per query; some strategies select harder-to-annotate examples. ~~(5)~~ '''Stability''': run 5 trials with different random seeds; active learning can have high variance.		Active learning evaluation:
			# '''Learning curve''': plot accuracy vs. number of labeled examples; compare to random sampling baseline at each budget.
			# '''Label efficiency''': how many labels are needed to achieve X% of fully-supervised performance?
			# '''Test on held-out set''': evaluation must use examples never selected by active learning to avoid selection bias.
			# '''Annotation cost''': measure actual oracle time per query; some strategies select harder-to-annotate examples.
			# '''Stability''': run 5 trials with different random seeds; active learning can have high variance.

	== Creating ==		== Creating ==
	Designing a production active learning pipeline: ~~(1)~~ '''Start random''': label 50–100 random examples to bootstrap. ~~(2)~~ '''Train initial model''': establish a baseline. ~~(3)~~ '''Query''': use BADGE or uncertainty sampling to select next batch (10–50 examples). ~~(4)~~ '''Annotate''': route selected examples to human annotators via labeling tool (Label Studio, Scale AI). ~~(5)~~ '''Retrain''': retrain model on expanded labeled set. ~~(6)~~ '''Evaluate''': measure performance on held-out test set; continue if below target. ~~(7)~~ '''Stopping rule''': stop when performance gain per label drops below threshold — the marginal return on annotation investment.		Designing a production active learning pipeline:
			# '''Start random''': label 50–100 random examples to bootstrap.
			# '''Train initial model''': establish a baseline.
			# '''Query''': use BADGE or uncertainty sampling to select next batch (10–50 examples).
			# '''Annotate''': route selected examples to human annotators via labeling tool (Label Studio, Scale AI).
			# '''Retrain''': retrain model on expanded labeled set.
			# '''Evaluate''': measure performance on held-out test set; continue if below target.
			# '''Stopping rule''': stop when performance gain per label drops below threshold — the marginal return on annotation investment.

	[[Category:Artificial Intelligence]]		[[Category:Artificial Intelligence]]
	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Active Learning]]		[[Category:Active Learning]]

Wordpad: BloomWiki: Active Learning

2026-04-23T14:20:48Z

BloomWiki: Active Learning

← Older revision		Revision as of 14:20, 23 April 2026
Line 21:		Line 21:
	In most ML applications, labeling data is the bottleneck — not the model or the compute. A radiologist might take 5 minutes to annotate a CT scan; annotating 100,000 scans would require years of expert time. Active learning addresses this directly by asking: "which 1000 scans should we label to get the most accurate model?"		In most ML applications, labeling data is the bottleneck — not the model or the compute. A radiologist might take 5 minutes to annotate a CT scan; annotating 100,000 scans would require years of expert time. Active learning addresses this directly by asking: "which 1000 scans should we label to get the most accurate model?"

	Why random sampling is suboptimal: If 98% of images in a dataset are cats and 2% are rare diseases, random sampling wastes most of the annotation budget on cats that the model already handles well. Active learning focuses the budget on examples near the decision boundary or in underrepresented regions.		'''Why random sampling is suboptimal''': If 98% of images in a dataset are cats and 2% are rare diseases, random sampling wastes most of the annotation budget on cats that the model already handles well. Active learning focuses the budget on examples near the decision boundary or in underrepresented regions.

	Uncertainty sampling: The simplest and most widely used strategy. After training on the current labeled set, apply the model to all unlabeled examples. Select the examples where the model is least confident (e.g., predicted probability closest to 0.5 for binary classification). The intuition: these are the examples the model is currently "on the fence" about — labeling them provides the most information.		'''Uncertainty sampling''': The simplest and most widely used strategy. After training on the current labeled set, apply the model to all unlabeled examples. Select the examples where the model is least confident (e.g., predicted probability closest to 0.5 for binary classification). The intuition: these are the examples the model is currently "on the fence" about — labeling them provides the most information.

	Core-set selection: Instead of uncertainty, select examples that are geographically most distant from already-labeled examples in the feature space. This ensures the labeled set covers the full data distribution — addressing the cold-start problem that uncertainty sampling faces (before training, all uncertainty estimates are uninformative).		'''Core-set selection''': Instead of uncertainty, select examples that are geographically most distant from already-labeled examples in the feature space. This ensures the labeled set covers the full data distribution — addressing the cold-start problem that uncertainty sampling faces (before training, all uncertainty estimates are uninformative).

	The exploration-exploitation tension: Uncertainty sampling exploits model knowledge to label informative examples, but can get stuck labeling outliers or noise (uncertain examples are sometimes uncertain because they're anomalies, not informative boundary cases). Core-set ensures exploration of the data distribution. BADGE (Batch Active Learning by Diverse Gradient Embeddings) combines both by selecting a diverse, high-gradient batch.		'''The exploration-exploitation tension''': Uncertainty sampling exploits model knowledge to label informative examples, but can get stuck labeling outliers or noise (uncertain examples are sometimes uncertain because they're anomalies, not informative boundary cases). Core-set ensures exploration of the data distribution. BADGE (Batch Active Learning by Diverse Gradient Embeddings) combines both by selecting a diverse, high-gradient batch.

	== Applying ==		== Applying ==
Line 120:		Line 120:

	== Evaluating ==		== Evaluating ==
	Active learning evaluation: (1) Learning curve: plot accuracy vs. number of labeled examples; compare to random sampling baseline at each budget. (2) Label efficiency: how many labels are needed to achieve X% of fully-supervised performance? (3) Test on held-out set: evaluation must use examples never selected by active learning to avoid selection bias. (4) Annotation cost: measure actual oracle time per query; some strategies select harder-to-annotate examples. (5) Stability: run 5 trials with different random seeds; active learning can have high variance.		Active learning evaluation: (1) '''Learning curve''': plot accuracy vs. number of labeled examples; compare to random sampling baseline at each budget. (2) '''Label efficiency''': how many labels are needed to achieve X% of fully-supervised performance? (3) '''Test on held-out set''': evaluation must use examples never selected by active learning to avoid selection bias. (4) '''Annotation cost''': measure actual oracle time per query; some strategies select harder-to-annotate examples. (5) '''Stability''': run 5 trials with different random seeds; active learning can have high variance.

	== Creating ==		== Creating ==
	Designing a production active learning pipeline: (1) Start random: label 50–100 random examples to bootstrap. (2) Train initial model: establish a baseline. (3) Query: use BADGE or uncertainty sampling to select next batch (10–50 examples). (4) Annotate: route selected examples to human annotators via labeling tool (Label Studio, Scale AI). (5) Retrain: retrain model on expanded labeled set. (6) Evaluate: measure performance on held-out test set; continue if below target. (7) Stopping rule: stop when performance gain per label drops below threshold — the marginal return on annotation investment.		Designing a production active learning pipeline: (1) '''Start random''': label 50–100 random examples to bootstrap. (2) '''Train initial model''': establish a baseline. (3) '''Query''': use BADGE or uncertainty sampling to select next batch (10–50 examples). (4) '''Annotate''': route selected examples to human annotators via labeling tool (Label Studio, Scale AI). (5) '''Retrain''': retrain model on expanded labeled set. (6) '''Evaluate''': measure performance on held-out test set; continue if below target. (7) '''Stopping rule''': stop when performance gain per label drops below threshold — the marginal return on annotation investment.

	[[Category:Artificial Intelligence]]		[[Category:Artificial Intelligence]]
	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Active Learning]]		[[Category:Active Learning]]

Wordpad: BloomWiki: Active Learning

2026-04-23T12:20:47Z

BloomWiki: Active Learning

New page

{{BloomIntro}}
Active learning is a machine learning paradigm where the learning algorithm actively queries an oracle (usually a human expert) to label the most informative examples, rather than learning passively from a pre-labeled dataset. The key insight: not all labeled examples are equally valuable for learning. By strategically selecting which examples to label, active learning can achieve comparable model performance with dramatically fewer labels — often 10–20× fewer — dramatically reducing the annotation cost for building supervised ML systems.

== Remembering ==
* '''Active learning''' — A machine learning approach where the model selects its own training examples to query for labels.
* '''Oracle''' — The entity providing labels in active learning; typically a human expert.
* '''Query strategy''' — The criterion used to select which unlabeled examples to request labels for.
* '''Uncertainty sampling''' — Query the example the model is least certain about (most uncertain prediction).
* '''Query by committee''' — Train multiple models (committee); query examples where they disagree most.
* '''Core-set selection''' — Select examples that best represent the overall data distribution geometrically.
* '''Expected model change''' — Query the example that would cause the largest change in the model if labeled.
* '''Expected error reduction''' — Query the example that would most reduce the model's expected error.
* '''Batch active learning''' — Selecting a batch of K examples at once (more practical than one-at-a-time).
* '''Pool-based sampling''' — Select from a pool of unlabeled examples (most common active learning setting).
* '''Stream-based sampling''' — Examples arrive sequentially; decide whether to query each one.
* '''Cold start''' — The initial problem in active learning: model is untrained, so uncertainty estimates are unreliable.
* '''Annotation budget''' — The total number of labels the user is willing to provide.
* '''Label efficiency''' — The ratio of performance improvement to annotation cost; active learning maximizes this.

== Understanding ==
In most ML applications, labeling data is the bottleneck — not the model or the compute. A radiologist might take 5 minutes to annotate a CT scan; annotating 100,000 scans would require years of expert time. Active learning addresses this directly by asking: "which 1000 scans should we label to get the most accurate model?"

**Why random sampling is suboptimal**: If 98% of images in a dataset are cats and 2% are rare diseases, random sampling wastes most of the annotation budget on cats that the model already handles well. Active learning focuses the budget on examples near the decision boundary or in underrepresented regions.

**Uncertainty sampling**: The simplest and most widely used strategy. After training on the current labeled set, apply the model to all unlabeled examples. Select the examples where the model is least confident (e.g., predicted probability closest to 0.5 for binary classification). The intuition: these are the examples the model is currently "on the fence" about — labeling them provides the most information.

**Core-set selection**: Instead of uncertainty, select examples that are geographically most distant from already-labeled examples in the feature space. This ensures the labeled set covers the full data distribution — addressing the cold-start problem that uncertainty sampling faces (before training, all uncertainty estimates are uninformative).

**The exploration-exploitation tension**: Uncertainty sampling exploits model knowledge to label informative examples, but can get stuck labeling outliers or noise (uncertain examples are sometimes uncertain because they're anomalies, not informative boundary cases). Core-set ensures exploration of the data distribution. BADGE (Batch Active Learning by Diverse Gradient Embeddings) combines both by selecting a diverse, high-gradient batch.

== Applying ==
'''Active learning loop for image classification:'''
<syntaxhighlight lang="python">
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Subset
from torchvision import models

class ActiveLearner:
def __init__(self, model, unlabeled_pool, labeled_indices, device='cuda'):
self.model = model.to(device)
self.unlabeled_pool = unlabeled_pool
self.labeled_indices = list(labeled_indices)
self.unlabeled_indices = [i for i in range(len(unlabeled_pool))
if i not in set(labeled_indices)]
self.device = device

def train(self, labels_dict, epochs=5):
"""Train on currently labeled data."""
labeled_ds = Subset(self.unlabeled_pool, self.labeled_indices)
loader = DataLoader(labeled_ds, batch_size=32, shuffle=True)
opt = torch.optim.Adam(self.model.parameters(), lr=1e-4)
self.model.train()
for _ in range(epochs):
for X, _ in loader:
y = torch.tensor([labels_dict[i] for i in self.labeled_indices
if i < len(X)]).to(self.device)
X = X.to(self.device)
loss = nn.CrossEntropyLoss()(self.model(X), y)
opt.zero_grad(); loss.backward(); opt.step()

def query(self, n_query=10, strategy='uncertainty') -> list:
"""Select n_query examples to label next."""
unlabeled_ds = Subset(self.unlabeled_pool, self.unlabeled_indices)
loader = DataLoader(unlabeled_ds, batch_size=64)
self.model.eval()
all_probs = []
with torch.no_grad():
for X, _ in loader:
probs = torch.softmax(self.model(X.to(self.device)), dim=1).cpu()
all_probs.append(probs)
all_probs = torch.cat(all_probs)

if strategy == 'uncertainty':
# Least confidence: highest entropy
entropy = -(all_probs * all_probs.log().clamp(-100, 0)).sum(1)
query_local_idx = entropy.topk(n_query).indices.tolist()
elif strategy == 'margin':
# Smallest gap between top-2 predictions
top2 = all_probs.topk(2, dim=1).values
margin = top2[:, 0] - top2[:, 1]
query_local_idx = margin.topk(n_query, largest=False).indices.tolist()

return [self.unlabeled_indices[i] for i in query_local_idx]

def label_and_add(self, indices, labels_dict):
"""Add newly labeled examples to labeled set."""
self.labeled_indices.extend(indices)
self.unlabeled_indices = [i for i in self.unlabeled_indices if i not in set(indices)]
</syntaxhighlight>

; Active learning strategy guide
: '''Small budget (<100 labels)''' → Core-set (diversity), random seeding first 10
: '''Medium budget''' → BADGE (diverse + informative), uncertainty sampling
: '''Large unlabeled pool''' → Uncertainty sampling (fast); approximate with embeddings
: '''Batch selection''' → BADGE, k-means++ on uncertain examples
: '''NLP tasks''' → Uncertainty + semantic diversity filtering; avoid near-duplicate queries
: '''Medical/scientific''' → Core-set + expert-in-the-loop revision cycles

== Analyzing ==
{| class="wikitable"
|+ Active Learning Strategy Comparison
! Strategy !! Label Efficiency !! Computational Cost !! Cold Start Performance
|-
| Random sampling || Baseline || None || Good (diverse)
|-
| Uncertainty sampling || High || Low || Poor (overconfident early)
|-
| Core-set selection || High || Medium || Good (covers distribution)
|-
| Query by committee || High || High || Moderate
|-
| BADGE || Very high || Medium || Good
|-
| Expected model change || Very high || Very high || Good
|}

'''Failure modes''': Outlier bias — uncertainty sampling queries outliers (intrinsically ambiguous or mislabeled examples) that don't help generalization. Cold-start overconfidence — early models make confident but wrong predictions, querying non-informative examples. Distribution mismatch — labeled set diverges from test distribution. Oracle disagreement — human annotators disagree on queried examples, providing noisy labels.

== Evaluating ==
Active learning evaluation: (1) **Learning curve**: plot accuracy vs. number of labeled examples; compare to random sampling baseline at each budget. (2) **Label efficiency**: how many labels are needed to achieve X% of fully-supervised performance? (3) **Test on held-out set**: evaluation must use examples never selected by active learning to avoid selection bias. (4) **Annotation cost**: measure actual oracle time per query; some strategies select harder-to-annotate examples. (5) **Stability**: run 5 trials with different random seeds; active learning can have high variance.

== Creating ==
Designing a production active learning pipeline: (1) **Start random**: label 50–100 random examples to bootstrap. (2) **Train initial model**: establish a baseline. (3) **Query**: use BADGE or uncertainty sampling to select next batch (10–50 examples). (4) **Annotate**: route selected examples to human annotators via labeling tool (Label Studio, Scale AI). (5) **Retrain**: retrain model on expanded labeled set. (6) **Evaluate**: measure performance on held-out test set; continue if below target. (7) **Stopping rule**: stop when performance gain per label drops below threshold — the marginal return on annotation investment.

[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:Active Learning]]