Semi-Supervised Learning: Difference between revisions
New BloomWiki article: Semi-Supervised Learning |
BloomWiki: Semi-Supervised Learning |
||
| Line 1: | Line 1: | ||
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
{{BloomIntro}} | {{BloomIntro}} | ||
Semi-supervised learning sits between supervised learning (which requires labels for all training data) and unsupervised learning (which uses no labels). It leverages a small amount of labeled data alongside a large amount of unlabeled data to train better models than either approach alone. Since labeled data is expensive and time-consuming to acquire while unlabeled data is often abundantly available, semi-supervised learning is highly practical. Modern variants include pseudo-labeling, consistency regularization, and graph-based methods. | Semi-supervised learning sits between supervised learning (which requires labels for all training data) and unsupervised learning (which uses no labels). It leverages a small amount of labeled data alongside a large amount of unlabeled data to train better models than either approach alone. Since labeled data is expensive and time-consuming to acquire while unlabeled data is often abundantly available, semi-supervised learning is highly practical. Modern variants include pseudo-labeling, consistency regularization, and graph-based methods. | ||
</div> | |||
== Remembering == | __TOC__ | ||
<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
== <span style="color: #FFFFFF;">Remembering</span> == | |||
* '''Semi-supervised learning''' — Learning using a small labeled dataset and a large unlabeled dataset simultaneously. | * '''Semi-supervised learning''' — Learning using a small labeled dataset and a large unlabeled dataset simultaneously. | ||
* '''Pseudo-labeling''' — Using a model's predictions on unlabeled data as provisional labels, then retraining on those labels. | * '''Pseudo-labeling''' — Using a model's predictions on unlabeled data as provisional labels, then retraining on those labels. | ||
| Line 17: | Line 22: | ||
* '''Cluster assumption''' — Decision boundaries should lie in low-density regions between clusters. | * '''Cluster assumption''' — Decision boundaries should lie in low-density regions between clusters. | ||
* '''Confidence threshold''' — In pseudo-labeling, only use predictions where model confidence exceeds a threshold; avoids noisy pseudo-labels. | * '''Confidence threshold''' — In pseudo-labeling, only use predictions where model confidence exceeds a threshold; avoids noisy pseudo-labels. | ||
</div> | |||
== Understanding == | <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Understanding</span> == | |||
Semi-supervised learning works by exploiting the **structure of the unlabeled data distribution** to constrain the label function. The key assumptions: | Semi-supervised learning works by exploiting the **structure of the unlabeled data distribution** to constrain the label function. The key assumptions: | ||
| Line 32: | Line 39: | ||
**When does semi-supervised help most?** When labeled data is very scarce (<1000 examples) and unlabeled data shares the same distribution as labeled data. When distributions differ (domain shift between labeled and unlabeled), semi-supervised can hurt — a form of negative transfer. | **When does semi-supervised help most?** When labeled data is very scarce (<1000 examples) and unlabeled data shares the same distribution as labeled data. When distributions differ (domain shift between labeled and unlabeled), semi-supervised can hurt — a form of negative transfer. | ||
</div> | |||
== Applying == | <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Applying</span> == | |||
'''FixMatch implementation:''' | '''FixMatch implementation:''' | ||
<syntaxhighlight lang="python"> | <syntaxhighlight lang="python"> | ||
| Line 66: | Line 75: | ||
: '''Small labeled set (<100 samples)''' → Mean Teacher, MixMatch | : '''Small labeled set (<100 samples)''' → Mean Teacher, MixMatch | ||
: '''Production setting''' → Self-training with pseudo-labels (simple, scalable) | : '''Production setting''' → Self-training with pseudo-labels (simple, scalable) | ||
</div> | |||
== Analyzing == | <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Analyzing</span> == | |||
{| class="wikitable" | {| class="wikitable" | ||
|+ Semi-Supervised Methods Comparison | |+ Semi-Supervised Methods Comparison | ||
| Line 84: | Line 95: | ||
'''Failure modes''': Pseudo-label noise — incorrect confident predictions pollute training. Distribution mismatch — unlabeled data from different distribution hurts performance. Over-fitting on pseudo-labels — model memorizes spurious patterns in pseudo-labels. Confirmation bias — model fails to correct its own early confident errors. | '''Failure modes''': Pseudo-label noise — incorrect confident predictions pollute training. Distribution mismatch — unlabeled data from different distribution hurts performance. Over-fitting on pseudo-labels — model memorizes spurious patterns in pseudo-labels. Confirmation bias — model fails to correct its own early confident errors. | ||
</div> | |||
== Evaluating == | <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Evaluating</span> == | |||
Evaluation must match the practical setting: hold out a labeled test set; train only on the (small labeled) + (large unlabeled) split. Report accuracy as a function of labeled data fraction (1%, 5%, 10%) to show the semi-supervised benefit curve. Compare against: (1) supervised-only (small labeled set), (2) fully supervised (all labeled), and (3) self-supervised pre-training + fine-tuning as competing baselines. | Evaluation must match the practical setting: hold out a labeled test set; train only on the (small labeled) + (large unlabeled) split. Report accuracy as a function of labeled data fraction (1%, 5%, 10%) to show the semi-supervised benefit curve. Compare against: (1) supervised-only (small labeled set), (2) fully supervised (all labeled), and (3) self-supervised pre-training + fine-tuning as competing baselines. | ||
</div> | |||
== Creating == | <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Creating</span> == | |||
Designing a semi-supervised pipeline: (1) Start with self-training — simple, effective, easy to implement. (2) Set high confidence threshold (0.95+) to avoid noisy pseudo-labels. (3) Apply curriculum: increase unlabeled data usage as model improves (FlexMatch adaptive threshold). (4) For vision: use FixMatch with RandAugment strong augmentation. (5) For NLP: leverage domain-adaptive pre-training on unlabeled data, then fine-tune on labels. (6) Monitor pseudo-label quality: compute accuracy of pseudo-labels on held-out labeled data as a proxy for noise level. | Designing a semi-supervised pipeline: (1) Start with self-training — simple, effective, easy to implement. (2) Set high confidence threshold (0.95+) to avoid noisy pseudo-labels. (3) Apply curriculum: increase unlabeled data usage as model improves (FlexMatch adaptive threshold). (4) For vision: use FixMatch with RandAugment strong augmentation. (5) For NLP: leverage domain-adaptive pre-training on unlabeled data, then fine-tune on labels. (6) Monitor pseudo-label quality: compute accuracy of pseudo-labels on held-out labeled data as a proxy for noise level. | ||
| Line 94: | Line 109: | ||
[[Category:Machine Learning]] | [[Category:Machine Learning]] | ||
[[Category:Semi-Supervised Learning]] | [[Category:Semi-Supervised Learning]] | ||
</div> | |||
Latest revision as of 01:57, 25 April 2026
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Semi-supervised learning sits between supervised learning (which requires labels for all training data) and unsupervised learning (which uses no labels). It leverages a small amount of labeled data alongside a large amount of unlabeled data to train better models than either approach alone. Since labeled data is expensive and time-consuming to acquire while unlabeled data is often abundantly available, semi-supervised learning is highly practical. Modern variants include pseudo-labeling, consistency regularization, and graph-based methods.
Remembering[edit]
- Semi-supervised learning — Learning using a small labeled dataset and a large unlabeled dataset simultaneously.
- Pseudo-labeling — Using a model's predictions on unlabeled data as provisional labels, then retraining on those labels.
- Consistency regularization — Enforcing that model predictions remain consistent under perturbations of unlabeled inputs.
- Mean Teacher — A semi-supervised method where a student model is trained, and the teacher model is an exponential moving average of student weights; teacher provides pseudo-labels.
- FixMatch — A state-of-the-art semi-supervised image classification method using confidence thresholding and weak/strong augmentation consistency.
- MixMatch — A holistic semi-supervised approach combining pseudo-labeling, consistency regularization, and MixUp data augmentation.
- Self-training — Train on labeled data, predict labels for unlabeled data, retrain on the combination; repeat iteratively.
- Co-training — Train two models on different feature views; each provides pseudo-labels for the other.
- Graph-based methods — Propagate labels through a graph where edges represent similarity between examples (label propagation).
- Label propagation — Semi-supervised algorithm that spreads labels from labeled to unlabeled examples through a similarity graph.
- Manifold assumption — The assumption that data lies on a low-dimensional manifold; points on the same manifold should have the same label.
- Smoothness assumption — If two points are close in input space, they should have similar labels.
- Cluster assumption — Decision boundaries should lie in low-density regions between clusters.
- Confidence threshold — In pseudo-labeling, only use predictions where model confidence exceeds a threshold; avoids noisy pseudo-labels.
Understanding[edit]
Semi-supervised learning works by exploiting the **structure of the unlabeled data distribution** to constrain the label function. The key assumptions:
- Smoothness**: Nearby points → similar labels. If two images of dogs are close in feature space, they should both be labeled "dog."
- Cluster**: Classes form clusters. The decision boundary should pass through low-density regions between clusters, not through high-density regions.
- Manifold**: Data lies on lower-dimensional manifolds. Using unlabeled data to learn the manifold structure helps place decision boundaries correctly.
- Self-training process**: (1) Train on labeled data. (2) Predict labels for unlabeled data. (3) Add high-confidence predictions to training set. (4) Retrain. (5) Repeat. Risk: confident errors propagate (confirmation bias). Mitigated by strict confidence thresholds.
- FixMatch**: The state-of-the-art simple baseline. For each unlabeled image: (1) Apply weak augmentation (horizontal flip, crop). (2) If prediction confidence > 0.95, use as pseudo-label. (3) Apply strong augmentation (RandAugment). (4) Train student to predict the pseudo-label on the strongly augmented view. This enforces consistency across augmentation strengths while only training on confident pseudo-labels.
- When does semi-supervised help most?** When labeled data is very scarce (<1000 examples) and unlabeled data shares the same distribution as labeled data. When distributions differ (domain shift between labeled and unlabeled), semi-supervised can hurt — a form of negative transfer.
Applying[edit]
FixMatch implementation: <syntaxhighlight lang="python"> import torch import torch.nn.functional as F
def fixmatch_loss(model, labeled_x, labels, unlabeled_x_weak, unlabeled_x_strong,
threshold=0.95, lambda_u=1.0): # Supervised loss on labeled data logits_labeled = model(labeled_x) loss_supervised = F.cross_entropy(logits_labeled, labels)
# Pseudo-label on weakly augmented unlabeled data
with torch.no_grad():
logits_weak = model(unlabeled_x_weak)
probs_weak = F.softmax(logits_weak, dim=-1)
max_probs, pseudo_labels = probs_weak.max(dim=-1)
# Mask: only use predictions above confidence threshold
mask = (max_probs >= threshold).float()
# Consistency loss: predict pseudo-label on strongly augmented version logits_strong = model(unlabeled_x_strong) loss_unsupervised = (F.cross_entropy(logits_strong, pseudo_labels, reduction='none') * mask).mean()
return loss_supervised + lambda_u * loss_unsupervised
</syntaxhighlight>
- Semi-supervised method selection
- Image classification → FixMatch, FlexMatch, FreeMatch (confidence threshold scheduling)
- NLP → UDA (Unsupervised Data Augmentation), pre-train then fine-tune (BERT approach)
- Graph data → Label propagation, Graph Convolutional Networks (GCN)
- Small labeled set (<100 samples) → Mean Teacher, MixMatch
- Production setting → Self-training with pseudo-labels (simple, scalable)
Analyzing[edit]
| Method | Labeled Data Needed | Key Idea | Best Domain |
|---|---|---|---|
| Self-training | ~10-20% | Confidence filtering | Any |
| FixMatch | <1% | Consistency + threshold | Vision |
| Mean Teacher | <5% | EMA teacher labels | Vision |
| Label Propagation | ~5% | Graph diffusion | Low-dim, graph |
| BERT fine-tuning | <1% (semi) | Large pre-training | NLP |
Failure modes: Pseudo-label noise — incorrect confident predictions pollute training. Distribution mismatch — unlabeled data from different distribution hurts performance. Over-fitting on pseudo-labels — model memorizes spurious patterns in pseudo-labels. Confirmation bias — model fails to correct its own early confident errors.
Evaluating[edit]
Evaluation must match the practical setting: hold out a labeled test set; train only on the (small labeled) + (large unlabeled) split. Report accuracy as a function of labeled data fraction (1%, 5%, 10%) to show the semi-supervised benefit curve. Compare against: (1) supervised-only (small labeled set), (2) fully supervised (all labeled), and (3) self-supervised pre-training + fine-tuning as competing baselines.
Creating[edit]
Designing a semi-supervised pipeline: (1) Start with self-training — simple, effective, easy to implement. (2) Set high confidence threshold (0.95+) to avoid noisy pseudo-labels. (3) Apply curriculum: increase unlabeled data usage as model improves (FlexMatch adaptive threshold). (4) For vision: use FixMatch with RandAugment strong augmentation. (5) For NLP: leverage domain-adaptive pre-training on unlabeled data, then fine-tune on labels. (6) Monitor pseudo-label quality: compute accuracy of pseudo-labels on held-out labeled data as a proxy for noise level.