Editing Self Supervised (section)

== <span style="color: #FFFFFF;">Understanding</span> ==
The key insight of self-supervised learning is that '''data contains its own supervision signal''' if you know how to extract it. Human language is full of structure: words predict their neighbors, sentences follow each other coherently. Images have spatial structure: patches are consistent with their surroundings. Audio has temporal structure: frames predict nearby frames.

By designing tasks that exploit these structures, we can train models on billions of unlabeled examples — far more than could ever be labeled by humans. The result is representations that capture rich, generalizable features of the data.

'''Contrastive learning''' is the dominant paradigm for vision SSL. The idea: create two augmented views of the same image (positive pair) and train the model to map them to similar representations, while pushing representations of different images (negative pairs) apart. The model cannot cheat by mapping everything to the same point (called collapse) because it must distinguish different images.

'''Masked modeling''' is the dominant paradigm for NLP and increasingly vision. BERT masks 15% of tokens and trains the model to predict them. This forces the model to understand context and semantics — you can't predict a masked word without understanding the sentence. MAE extends this to images, masking 75% of patches and reconstructing them.

'''Why SSL beats supervised pretraining in many settings''': Supervised pretraining is limited to the labels available (1000 ImageNet classes). SSL trains on the full diversity of the data without label constraints, producing more general representations that transfer better to diverse downstream tasks.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">