Semi Supervised: Difference between revisions

From BloomWiki
Jump to navigation Jump to search
BloomWiki: Semi Supervised
 
BloomWiki: Semi Supervised
Line 27: Line 27:
'''Manifold''': Data lies on lower-dimensional manifolds. Using unlabeled data to learn the manifold structure helps place decision boundaries correctly.
'''Manifold''': Data lies on lower-dimensional manifolds. Using unlabeled data to learn the manifold structure helps place decision boundaries correctly.


'''Self-training process''': (1) Train on labeled data. (2) Predict labels for unlabeled data. (3) Add high-confidence predictions to training set. (4) Retrain. (5) Repeat. Risk: confident errors propagate (confirmation bias). Mitigated by strict confidence thresholds.
'''Self-training process''':
# Train on labeled data.
# Predict labels for unlabeled data.
# Add high-confidence predictions to training set.
# Retrain.
# Repeat. Risk: confident errors propagate (confirmation bias). Mitigated by strict confidence thresholds.


'''FixMatch''': The state-of-the-art simple baseline. For each unlabeled image: (1) Apply weak augmentation (horizontal flip, crop). (2) If prediction confidence > 0.95, use as pseudo-label. (3) Apply strong augmentation (RandAugment). (4) Train student to predict the pseudo-label on the strongly augmented view. This enforces consistency across augmentation strengths while only training on confident pseudo-labels.
'''FixMatch''': The state-of-the-art simple baseline. For each unlabeled image:
# Apply weak augmentation (horizontal flip, crop).
# If prediction confidence > 0.95, use as pseudo-label.
# Apply strong augmentation (RandAugment).
# Train student to predict the pseudo-label on the strongly augmented view. This enforces consistency across augmentation strengths while only training on confident pseudo-labels.


'''When does semi-supervised help most?''' When labeled data is very scarce (<1000 examples) and unlabeled data shares the same distribution as labeled data. When distributions differ (domain shift between labeled and unlabeled), semi-supervised can hurt — a form of negative transfer.
'''When does semi-supervised help most?''' When labeled data is very scarce (<1000 examples) and unlabeled data shares the same distribution as labeled data. When distributions differ (domain shift between labeled and unlabeled), semi-supervised can hurt — a form of negative transfer.
Line 86: Line 95:


== Evaluating ==
== Evaluating ==
Evaluation must match the practical setting: hold out a labeled test set; train only on the (small labeled) + (large unlabeled) split. Report accuracy as a function of labeled data fraction (1%, 5%, 10%) to show the semi-supervised benefit curve. Compare against: (1) supervised-only (small labeled set), (2) fully supervised (all labeled), and (3) self-supervised pre-training + fine-tuning as competing baselines.
Evaluation must match the practical setting: hold out a labeled test set; train only on the (small labeled) + (large unlabeled) split. Report accuracy as a function of labeled data fraction (1%, 5%, 10%) to show the semi-supervised benefit curve. Compare against:
# supervised-only (small labeled set),
# fully supervised (all labeled), and
# self-supervised pre-training + fine-tuning as competing baselines.


== Creating ==
== Creating ==
Designing a semi-supervised pipeline: (1) Start with self-training — simple, effective, easy to implement. (2) Set high confidence threshold (0.95+) to avoid noisy pseudo-labels. (3) Apply curriculum: increase unlabeled data usage as model improves (FlexMatch adaptive threshold). (4) For vision: use FixMatch with RandAugment strong augmentation. (5) For NLP: leverage domain-adaptive pre-training on unlabeled data, then fine-tune on labels. (6) Monitor pseudo-label quality: compute accuracy of pseudo-labels on held-out labeled data as a proxy for noise level.
Designing a semi-supervised pipeline:
# Start with self-training — simple, effective, easy to implement.
# Set high confidence threshold (0.95+) to avoid noisy pseudo-labels.
# Apply curriculum: increase unlabeled data usage as model improves (FlexMatch adaptive threshold).
# For vision: use FixMatch with RandAugment strong augmentation.
# For NLP: leverage domain-adaptive pre-training on unlabeled data, then fine-tune on labels.
# Monitor pseudo-label quality: compute accuracy of pseudo-labels on held-out labeled data as a proxy for noise level.


[[Category:Artificial Intelligence]]
[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:Machine Learning]]
[[Category:Semi-Supervised Learning]]
[[Category:Semi-Supervised Learning]]

Revision as of 14:35, 23 April 2026

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Semi-supervised learning sits between supervised learning (which requires labels for all training data) and unsupervised learning (which uses no labels). It leverages a small amount of labeled data alongside a large amount of unlabeled data to train better models than either approach alone. Since labeled data is expensive and time-consuming to acquire while unlabeled data is often abundantly available, semi-supervised learning is highly practical. Modern variants include pseudo-labeling, consistency regularization, and graph-based methods.

Remembering

  • Semi-supervised learning — Learning using a small labeled dataset and a large unlabeled dataset simultaneously.
  • Pseudo-labeling — Using a model's predictions on unlabeled data as provisional labels, then retraining on those labels.
  • Consistency regularization — Enforcing that model predictions remain consistent under perturbations of unlabeled inputs.
  • Mean Teacher — A semi-supervised method where a student model is trained, and the teacher model is an exponential moving average of student weights; teacher provides pseudo-labels.
  • FixMatch — A state-of-the-art semi-supervised image classification method using confidence thresholding and weak/strong augmentation consistency.
  • MixMatch — A holistic semi-supervised approach combining pseudo-labeling, consistency regularization, and MixUp data augmentation.
  • Self-training — Train on labeled data, predict labels for unlabeled data, retrain on the combination; repeat iteratively.
  • Co-training — Train two models on different feature views; each provides pseudo-labels for the other.
  • Graph-based methods — Propagate labels through a graph where edges represent similarity between examples (label propagation).
  • Label propagation — Semi-supervised algorithm that spreads labels from labeled to unlabeled examples through a similarity graph.
  • Manifold assumption — The assumption that data lies on a low-dimensional manifold; points on the same manifold should have the same label.
  • Smoothness assumption — If two points are close in input space, they should have similar labels.
  • Cluster assumption — Decision boundaries should lie in low-density regions between clusters.
  • Confidence threshold — In pseudo-labeling, only use predictions where model confidence exceeds a threshold; avoids noisy pseudo-labels.

Understanding

Semi-supervised learning works by exploiting the structure of the unlabeled data distribution to constrain the label function. The key assumptions:

Smoothness: Nearby points → similar labels. If two images of dogs are close in feature space, they should both be labeled "dog."

Cluster: Classes form clusters. The decision boundary should pass through low-density regions between clusters, not through high-density regions.

Manifold: Data lies on lower-dimensional manifolds. Using unlabeled data to learn the manifold structure helps place decision boundaries correctly.

Self-training process:

  1. Train on labeled data.
  2. Predict labels for unlabeled data.
  3. Add high-confidence predictions to training set.
  4. Retrain.
  5. Repeat. Risk: confident errors propagate (confirmation bias). Mitigated by strict confidence thresholds.

FixMatch: The state-of-the-art simple baseline. For each unlabeled image:

  1. Apply weak augmentation (horizontal flip, crop).
  2. If prediction confidence > 0.95, use as pseudo-label.
  3. Apply strong augmentation (RandAugment).
  4. Train student to predict the pseudo-label on the strongly augmented view. This enforces consistency across augmentation strengths while only training on confident pseudo-labels.

When does semi-supervised help most? When labeled data is very scarce (<1000 examples) and unlabeled data shares the same distribution as labeled data. When distributions differ (domain shift between labeled and unlabeled), semi-supervised can hurt — a form of negative transfer.

Applying

FixMatch implementation: <syntaxhighlight lang="python"> import torch import torch.nn.functional as F

def fixmatch_loss(model, labeled_x, labels, unlabeled_x_weak, unlabeled_x_strong,

                 threshold=0.95, lambda_u=1.0):
   # Supervised loss on labeled data
   logits_labeled = model(labeled_x)
   loss_supervised = F.cross_entropy(logits_labeled, labels)
   # Pseudo-label on weakly augmented unlabeled data
   with torch.no_grad():
       logits_weak = model(unlabeled_x_weak)
       probs_weak = F.softmax(logits_weak, dim=-1)
       max_probs, pseudo_labels = probs_weak.max(dim=-1)
       # Mask: only use predictions above confidence threshold
       mask = (max_probs >= threshold).float()
   # Consistency loss: predict pseudo-label on strongly augmented version
   logits_strong = model(unlabeled_x_strong)
   loss_unsupervised = (F.cross_entropy(logits_strong, pseudo_labels, reduction='none') * mask).mean()
   return loss_supervised + lambda_u * loss_unsupervised

</syntaxhighlight>

Semi-supervised method selection
Image classification → FixMatch, FlexMatch, FreeMatch (confidence threshold scheduling)
NLP → UDA (Unsupervised Data Augmentation), pre-train then fine-tune (BERT approach)
Graph data → Label propagation, Graph Convolutional Networks (GCN)
Small labeled set (<100 samples) → Mean Teacher, MixMatch
Production setting → Self-training with pseudo-labels (simple, scalable)

Analyzing

Semi-Supervised Methods Comparison
Method Labeled Data Needed Key Idea Best Domain
Self-training ~10-20% Confidence filtering Any
FixMatch <1% Consistency + threshold Vision
Mean Teacher <5% EMA teacher labels Vision
Label Propagation ~5% Graph diffusion Low-dim, graph
BERT fine-tuning <1% (semi) Large pre-training NLP

Failure modes: Pseudo-label noise — incorrect confident predictions pollute training. Distribution mismatch — unlabeled data from different distribution hurts performance. Over-fitting on pseudo-labels — model memorizes spurious patterns in pseudo-labels. Confirmation bias — model fails to correct its own early confident errors.

Evaluating

Evaluation must match the practical setting: hold out a labeled test set; train only on the (small labeled) + (large unlabeled) split. Report accuracy as a function of labeled data fraction (1%, 5%, 10%) to show the semi-supervised benefit curve. Compare against:

  1. supervised-only (small labeled set),
  2. fully supervised (all labeled), and
  3. self-supervised pre-training + fine-tuning as competing baselines.

Creating

Designing a semi-supervised pipeline:

  1. Start with self-training — simple, effective, easy to implement.
  2. Set high confidence threshold (0.95+) to avoid noisy pseudo-labels.
  3. Apply curriculum: increase unlabeled data usage as model improves (FlexMatch adaptive threshold).
  4. For vision: use FixMatch with RandAugment strong augmentation.
  5. For NLP: leverage domain-adaptive pre-training on unlabeled data, then fine-tune on labels.
  6. Monitor pseudo-label quality: compute accuracy of pseudo-labels on held-out labeled data as a proxy for noise level.