Semi Supervised: Difference between revisions
BloomWiki: Semi Supervised |
BloomWiki: Semi Supervised |
||
| Line 1: | Line 1: | ||
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
{{BloomIntro}} | {{BloomIntro}} | ||
Semi-supervised learning sits between supervised learning (which requires labels for all training data) and unsupervised learning (which uses no labels). It leverages a small amount of labeled data alongside a large amount of unlabeled data to train better models than either approach alone. Since labeled data is expensive and time-consuming to acquire while unlabeled data is often abundantly available, semi-supervised learning is highly practical. Modern variants include pseudo-labeling, consistency regularization, and graph-based methods. | Semi-supervised learning sits between supervised learning (which requires labels for all training data) and unsupervised learning (which uses no labels). It leverages a small amount of labeled data alongside a large amount of unlabeled data to train better models than either approach alone. Since labeled data is expensive and time-consuming to acquire while unlabeled data is often abundantly available, semi-supervised learning is highly practical. Modern variants include pseudo-labeling, consistency regularization, and graph-based methods. | ||
</div> | |||
== Remembering == | __TOC__ | ||
<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
== <span style="color: #FFFFFF;">Remembering</span> == | |||
* '''Semi-supervised learning''' — Learning using a small labeled dataset and a large unlabeled dataset simultaneously. | * '''Semi-supervised learning''' — Learning using a small labeled dataset and a large unlabeled dataset simultaneously. | ||
* '''Pseudo-labeling''' — Using a model's predictions on unlabeled data as provisional labels, then retraining on those labels. | * '''Pseudo-labeling''' — Using a model's predictions on unlabeled data as provisional labels, then retraining on those labels. | ||
| Line 17: | Line 22: | ||
* '''Cluster assumption''' — Decision boundaries should lie in low-density regions between clusters. | * '''Cluster assumption''' — Decision boundaries should lie in low-density regions between clusters. | ||
* '''Confidence threshold''' — In pseudo-labeling, only use predictions where model confidence exceeds a threshold; avoids noisy pseudo-labels. | * '''Confidence threshold''' — In pseudo-labeling, only use predictions where model confidence exceeds a threshold; avoids noisy pseudo-labels. | ||
</div> | |||
== Understanding == | <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Understanding</span> == | |||
Semi-supervised learning works by exploiting the '''structure of the unlabeled data distribution''' to constrain the label function. The key assumptions: | Semi-supervised learning works by exploiting the '''structure of the unlabeled data distribution''' to constrain the label function. The key assumptions: | ||
| Line 41: | Line 48: | ||
'''When does semi-supervised help most?''' When labeled data is very scarce (<1000 examples) and unlabeled data shares the same distribution as labeled data. When distributions differ (domain shift between labeled and unlabeled), semi-supervised can hurt — a form of negative transfer. | '''When does semi-supervised help most?''' When labeled data is very scarce (<1000 examples) and unlabeled data shares the same distribution as labeled data. When distributions differ (domain shift between labeled and unlabeled), semi-supervised can hurt — a form of negative transfer. | ||
</div> | |||
== Applying == | <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Applying</span> == | |||
'''FixMatch implementation:''' | '''FixMatch implementation:''' | ||
<syntaxhighlight lang="python"> | <syntaxhighlight lang="python"> | ||
| Line 75: | Line 84: | ||
: '''Small labeled set (<100 samples)''' → Mean Teacher, MixMatch | : '''Small labeled set (<100 samples)''' → Mean Teacher, MixMatch | ||
: '''Production setting''' → Self-training with pseudo-labels (simple, scalable) | : '''Production setting''' → Self-training with pseudo-labels (simple, scalable) | ||
</div> | |||
== Analyzing == | <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Analyzing</span> == | |||
{| class="wikitable" | {| class="wikitable" | ||
|+ Semi-Supervised Methods Comparison | |+ Semi-Supervised Methods Comparison | ||
| Line 93: | Line 104: | ||
'''Failure modes''': Pseudo-label noise — incorrect confident predictions pollute training. Distribution mismatch — unlabeled data from different distribution hurts performance. Over-fitting on pseudo-labels — model memorizes spurious patterns in pseudo-labels. Confirmation bias — model fails to correct its own early confident errors. | '''Failure modes''': Pseudo-label noise — incorrect confident predictions pollute training. Distribution mismatch — unlabeled data from different distribution hurts performance. Over-fitting on pseudo-labels — model memorizes spurious patterns in pseudo-labels. Confirmation bias — model fails to correct its own early confident errors. | ||
</div> | |||
== Evaluating == | <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Evaluating</span> == | |||
Evaluation must match the practical setting: hold out a labeled test set; train only on the (small labeled) + (large unlabeled) split. Report accuracy as a function of labeled data fraction (1%, 5%, 10%) to show the semi-supervised benefit curve. Compare against: | Evaluation must match the practical setting: hold out a labeled test set; train only on the (small labeled) + (large unlabeled) split. Report accuracy as a function of labeled data fraction (1%, 5%, 10%) to show the semi-supervised benefit curve. Compare against: | ||
# supervised-only (small labeled set), | # supervised-only (small labeled set), | ||
# fully supervised (all labeled), and | # fully supervised (all labeled), and | ||
# self-supervised pre-training + fine-tuning as competing baselines. | # self-supervised pre-training + fine-tuning as competing baselines. | ||
</div> | |||
== Creating == | <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Creating</span> == | |||
Designing a semi-supervised pipeline: | Designing a semi-supervised pipeline: | ||
# Start with self-training — simple, effective, easy to implement. | # Start with self-training — simple, effective, easy to implement. | ||
| Line 112: | Line 127: | ||
[[Category:Machine Learning]] | [[Category:Machine Learning]] | ||
[[Category:Semi-Supervised Learning]] | [[Category:Semi-Supervised Learning]] | ||
</div> | |||
Latest revision as of 01:57, 25 April 2026
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Semi-supervised learning sits between supervised learning (which requires labels for all training data) and unsupervised learning (which uses no labels). It leverages a small amount of labeled data alongside a large amount of unlabeled data to train better models than either approach alone. Since labeled data is expensive and time-consuming to acquire while unlabeled data is often abundantly available, semi-supervised learning is highly practical. Modern variants include pseudo-labeling, consistency regularization, and graph-based methods.
Remembering[edit]
- Semi-supervised learning — Learning using a small labeled dataset and a large unlabeled dataset simultaneously.
- Pseudo-labeling — Using a model's predictions on unlabeled data as provisional labels, then retraining on those labels.
- Consistency regularization — Enforcing that model predictions remain consistent under perturbations of unlabeled inputs.
- Mean Teacher — A semi-supervised method where a student model is trained, and the teacher model is an exponential moving average of student weights; teacher provides pseudo-labels.
- FixMatch — A state-of-the-art semi-supervised image classification method using confidence thresholding and weak/strong augmentation consistency.
- MixMatch — A holistic semi-supervised approach combining pseudo-labeling, consistency regularization, and MixUp data augmentation.
- Self-training — Train on labeled data, predict labels for unlabeled data, retrain on the combination; repeat iteratively.
- Co-training — Train two models on different feature views; each provides pseudo-labels for the other.
- Graph-based methods — Propagate labels through a graph where edges represent similarity between examples (label propagation).
- Label propagation — Semi-supervised algorithm that spreads labels from labeled to unlabeled examples through a similarity graph.
- Manifold assumption — The assumption that data lies on a low-dimensional manifold; points on the same manifold should have the same label.
- Smoothness assumption — If two points are close in input space, they should have similar labels.
- Cluster assumption — Decision boundaries should lie in low-density regions between clusters.
- Confidence threshold — In pseudo-labeling, only use predictions where model confidence exceeds a threshold; avoids noisy pseudo-labels.
Understanding[edit]
Semi-supervised learning works by exploiting the structure of the unlabeled data distribution to constrain the label function. The key assumptions:
Smoothness: Nearby points → similar labels. If two images of dogs are close in feature space, they should both be labeled "dog."
Cluster: Classes form clusters. The decision boundary should pass through low-density regions between clusters, not through high-density regions.
Manifold: Data lies on lower-dimensional manifolds. Using unlabeled data to learn the manifold structure helps place decision boundaries correctly.
Self-training process:
- Train on labeled data.
- Predict labels for unlabeled data.
- Add high-confidence predictions to training set.
- Retrain.
- Repeat. Risk: confident errors propagate (confirmation bias). Mitigated by strict confidence thresholds.
FixMatch: The state-of-the-art simple baseline. For each unlabeled image:
- Apply weak augmentation (horizontal flip, crop).
- If prediction confidence > 0.95, use as pseudo-label.
- Apply strong augmentation (RandAugment).
- Train student to predict the pseudo-label on the strongly augmented view. This enforces consistency across augmentation strengths while only training on confident pseudo-labels.
When does semi-supervised help most? When labeled data is very scarce (<1000 examples) and unlabeled data shares the same distribution as labeled data. When distributions differ (domain shift between labeled and unlabeled), semi-supervised can hurt — a form of negative transfer.
Applying[edit]
FixMatch implementation: <syntaxhighlight lang="python"> import torch import torch.nn.functional as F
def fixmatch_loss(model, labeled_x, labels, unlabeled_x_weak, unlabeled_x_strong,
threshold=0.95, lambda_u=1.0): # Supervised loss on labeled data logits_labeled = model(labeled_x) loss_supervised = F.cross_entropy(logits_labeled, labels)
# Pseudo-label on weakly augmented unlabeled data
with torch.no_grad():
logits_weak = model(unlabeled_x_weak)
probs_weak = F.softmax(logits_weak, dim=-1)
max_probs, pseudo_labels = probs_weak.max(dim=-1)
# Mask: only use predictions above confidence threshold
mask = (max_probs >= threshold).float()
# Consistency loss: predict pseudo-label on strongly augmented version logits_strong = model(unlabeled_x_strong) loss_unsupervised = (F.cross_entropy(logits_strong, pseudo_labels, reduction='none') * mask).mean()
return loss_supervised + lambda_u * loss_unsupervised
</syntaxhighlight>
- Semi-supervised method selection
- Image classification → FixMatch, FlexMatch, FreeMatch (confidence threshold scheduling)
- NLP → UDA (Unsupervised Data Augmentation), pre-train then fine-tune (BERT approach)
- Graph data → Label propagation, Graph Convolutional Networks (GCN)
- Small labeled set (<100 samples) → Mean Teacher, MixMatch
- Production setting → Self-training with pseudo-labels (simple, scalable)
Analyzing[edit]
| Method | Labeled Data Needed | Key Idea | Best Domain |
|---|---|---|---|
| Self-training | ~10-20% | Confidence filtering | Any |
| FixMatch | <1% | Consistency + threshold | Vision |
| Mean Teacher | <5% | EMA teacher labels | Vision |
| Label Propagation | ~5% | Graph diffusion | Low-dim, graph |
| BERT fine-tuning | <1% (semi) | Large pre-training | NLP |
Failure modes: Pseudo-label noise — incorrect confident predictions pollute training. Distribution mismatch — unlabeled data from different distribution hurts performance. Over-fitting on pseudo-labels — model memorizes spurious patterns in pseudo-labels. Confirmation bias — model fails to correct its own early confident errors.
Evaluating[edit]
Evaluation must match the practical setting: hold out a labeled test set; train only on the (small labeled) + (large unlabeled) split. Report accuracy as a function of labeled data fraction (1%, 5%, 10%) to show the semi-supervised benefit curve. Compare against:
- supervised-only (small labeled set),
- fully supervised (all labeled), and
- self-supervised pre-training + fine-tuning as competing baselines.
Creating[edit]
Designing a semi-supervised pipeline:
- Start with self-training — simple, effective, easy to implement.
- Set high confidence threshold (0.95+) to avoid noisy pseudo-labels.
- Apply curriculum: increase unlabeled data usage as model improves (FlexMatch adaptive threshold).
- For vision: use FixMatch with RandAugment strong augmentation.
- For NLP: leverage domain-adaptive pre-training on unlabeled data, then fine-tune on labels.
- Monitor pseudo-label quality: compute accuracy of pseudo-labels on held-out labeled data as a proxy for noise level.