Editing Self-Supervised Learning

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Self-supervised learning (SSL) is a machine learning paradigm in which models learn representations from unlabeled data by solving pretext tasks — automatically generated training signals derived from the data itself. Instead of requiring expensive human annotations, self-supervised learning extracts supervision from the structure inherent in data: predicting masked words, predicting image patches, or ensuring that different views of the same image produce similar representations. SSL is the engine behind modern large language models, BERT, CLIP, and vision foundation models — making it one of the most impactful ideas in contemporary AI.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Self-supervised learning''' — A form of unsupervised learning where supervision signals are generated automatically from the data, without human annotation.
* '''Pretext task''' — An artificially constructed task whose labels are derived from the data itself, designed to force the model to learn useful representations.
* '''Masked language modeling (MLM)''' — A pretext task where random tokens in a sequence are masked and the model must predict them. Used to train BERT.
* '''Next sentence prediction''' — A pretext task where the model predicts whether two sentences are consecutive or randomly paired.
* '''Contrastive learning''' — An SSL approach where the model learns by contrasting similar (positive) and dissimilar (negative) pairs of examples.
* '''Positive pair''' — Two views or augmentations of the same data point that should be represented similarly.
* '''Negative pair''' — Two different data points whose representations should be pushed apart.
* '''Augmentation''' — Transformations applied to data (cropping, color jitter, masking) to create different views of the same underlying content.
* '''Representation''' — A dense vector capturing the semantically meaningful content of a data point, learned without supervision.
* '''Downstream task''' — The actual task of interest (classification, detection, etc.) for which the self-supervised representation is subsequently used.
* '''Linear probing''' — Evaluating SSL representations by training only a linear classifier on frozen features; measures representation quality.
* '''SimCLR''' — A simple contrastive learning framework for visual representations (Google, 2020).
* '''BYOL (Bootstrap Your Own Latent)''' — A contrastive SSL method that does not use negative samples; uses a momentum encoder.
* '''MAE (Masked Autoencoder)''' — An SSL approach for vision that masks large portions of image patches and reconstructs them; analogous to BERT for images.
* '''DINO''' — A self-supervised vision transformer method using self-distillation without labels.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
The key insight of self-supervised learning is that **data contains its own supervision signal** if you know how to extract it. Human language is full of structure: words predict their neighbors, sentences follow each other coherently. Images have spatial structure: patches are consistent with their surroundings. Audio has temporal structure: frames predict nearby frames.

By designing tasks that exploit these structures, we can train models on billions of unlabeled examples — far more than could ever be labeled by humans. The result is representations that capture rich, generalizable features of the data.

**Contrastive learning** is the dominant paradigm for vision SSL. The idea: create two augmented views of the same image (positive pair) and train the model to map them to similar representations, while pushing representations of different images (negative pairs) apart. The model cannot cheat by mapping everything to the same point (called collapse) because it must distinguish different images.

**Masked modeling** is the dominant paradigm for NLP and increasingly vision. BERT masks 15% of tokens and trains the model to predict them. This forces the model to understand context and semantics — you can't predict a masked word without understanding the sentence. MAE extends this to images, masking 75% of patches and reconstructing them.

**Why SSL beats supervised pretraining in many settings**: Supervised pretraining is limited to the labels available (1000 ImageNet classes). SSL trains on the full diversity of the data without label constraints, producing more general representations that transfer better to diverse downstream tasks.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Contrastive self-supervised pre-training with SimCLR:'''
<syntaxhighlight lang="python">
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as T
from torchvision.models import resnet50

class SimCLR(nn.Module):
    def __init__(self, projection_dim=128, temperature=0.5):
        super().__init__()
        self.temperature = temperature
        # Backbone: ResNet-50 without final FC
        backbone = resnet50(weights=None)
        self.encoder = nn.Sequential(*list(backbone.children())[:-1])
        # Projection head: 2-layer MLP
        self.projector = nn.Sequential(
            nn.Linear(2048, 2048), nn.ReLU(),
            nn.Linear(2048, projection_dim)
        )

    def forward(self, x1, x2):
        h1 = self.encoder(x1).squeeze()
        h2 = self.encoder(x2).squeeze()
        z1 = F.normalize(self.projector(h1), dim=1)
        z2 = F.normalize(self.projector(h2), dim=1)
        return self.nt_xent_loss(z1, z2)

    def nt_xent_loss(self, z1, z2):
        """NT-Xent (Normalized Temperature-scaled Cross Entropy) loss."""
        N = z1.size(0)
        z = torch.cat([z1, z2], dim=0)  # 2N x D
        sim = torch.mm(z, z.T) / self.temperature  # 2N x 2N
        # Mask self-similarity
        mask = torch.eye(2*N, dtype=bool)
        sim.masked_fill_(mask, float('-inf'))
        # Positive pairs are at offsets [i, i+N] and [i+N, i]
        labels = torch.cat([torch.arange(N) + N, torch.arange(N)])
        return F.cross_entropy(sim, labels.to(z.device))

# Augmentation pipeline for SSL
ssl_transform = T.Compose([
    T.RandomResizedCrop(224, scale=(0.2, 1.0)),
    T.RandomHorizontalFlip(),
    T.ColorJitter(0.8, 0.8, 0.8, 0.2),
    T.RandomGrayscale(p=0.2),
    T.GaussianBlur(kernel_size=23),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
</syntaxhighlight>

; SSL method selection guide
: '''NLP pretraining''' → MLM (BERT-style); causal LM (GPT-style)
: '''Vision: contrastive''' → SimCLR, MoCo v3, BYOL (no negatives)
: '''Vision: masked reconstruction''' → MAE, BEiT, SimMIM
: '''Vision: knowledge distillation''' → DINO, DINOv2
: '''Audio''' → wav2vec 2.0, HuBERT (masked acoustic modeling)
: '''Multimodal''' → CLIP (image-text contrastive), FLAVA
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ SSL Paradigm Comparison
! Paradigm !! Requires Negatives !! Large Batch Needed !! Best Domain
|-
| Contrastive (SimCLR) || Yes || Yes (4096+) || Vision
|-
| Non-contrastive (BYOL) || No || No || Vision
|-
| Masked modeling (BERT) || No || No || NLP, Vision
|-
| Distillation (DINO) || No || No || Vision
|-
| Generative (GPT) || No || No || NLP
|}

'''Failure modes''': Representation collapse (all embeddings map to same point) — mitigated by batch normalization, stop-gradient tricks, or negative samples. Augmentation sensitivity — the SSL signal depends entirely on the augmentation strategy; wrong augmentations teach wrong invariances (e.g., color jitter teaches color invariance, bad for tasks requiring color discrimination).
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
'''Linear probing accuracy''' is the standard: freeze pre-trained encoder, train a linear classifier on its features. Higher accuracy = better representation quality. Compare SSL methods on ImageNet linear probe or NLP GLUE benchmarks. '''Few-shot evaluation''': evaluate how well the SSL representation transfers with only 1–10 labeled examples per class. Expert practitioners also measure semantic alignment — do nearest neighbors in the representation space make semantic sense?
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Designing an SSL pipeline for a new domain: (1) Choose appropriate augmentations that preserve semantic content but vary surface features. (2) Select paradigm: contrastive for vision/audio, masked modeling for text/structured data. (3) Pre-train on all available unlabeled data. (4) Linear probe evaluation at checkpoints. (5) Fine-tune on labeled data with lower learning rate than training from scratch. (6) Expect to outperform supervised baselines when labeled data is scarce (<10k examples).

[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:Self-Supervised Learning]]
</div>