Self-Supervised Learning (SSL) is a family of techniques for converting an unsupervised learning problem into a supervised one by creating surrogate labels from the unlabeled dataset.


Core Paradigms of Self-Supervised Learning

Three primary paradigms dominate the SSL landscape: Joint Embedding, Masked Image Modeling, and a hybrid approach that combines elements of both.

Joint Embedding vs. MIM:

Joint EmbeddingMasked Image Modeling (MIM)
Pros:Pros:
✓ Produces highly semantic features, great for classification.✓ Conceptually simple, with no need for positive/negative pairs.
✓ Architecture agnostic.✓ Masking reduces pre-training time.
✓ Achieves competitive results in linear probing evaluations.✓ Achieves competitive results with fine-tuning.
✓ Stronger fit for low-level tasks (e.g., denoising, super-resolution).
Cons:Cons:
✗ May require very large batch sizes (e.g., SimCLR).✗ Requires a Vision Transformer (ViT) backbone.
✗ Requires careful tuning of data augmentations.✗ Weaker performance on abstract, high-level tasks like classification.
✗ Requires special mechanisms to handle negative samples or avoid collapse.
✗ Not well-suited for low-level tasks.

1. Joint Embedding Architectures

Central idea: Enforce invariance to data augmentations. A Siamese network architecture with shared parameters processes two different augmented “views” of the same image and trains the model to produce similar or identical embeddings for both.

Challenge: model collapse, where the network learns a trivial solution by mapping all inputs to the same constant vector.

Method CategoryCore IdeaKey CharacteristicsExamples
ContrastivePull positive pairs (views of the same image) close in the embedding space while pushing negative pairs (views of different images) apart.Requires negative samples, which can necessitate large batch sizes. Good for multimodal data.SimCLR, MoCo
ClusteringLearn embeddings by grouping similar samples into clusters without using explicit negative pairs.Jointly learns feature representations and cluster assignments.SwAV, Deep Cluster
DistillationA “student” network is trained to match the output distribution of a “teacher” network on different augmented views.Avoids collapse via an asymmetric architecture (student vs. teacher). The teacher is often updated via an Exponential Moving Average (EMA) of the student’s weights. Does not require negative samples.BYOL, DINO
RegularizationAvoids collapse by imposing regularization terms on the embeddings, such as decorrelating feature dimensions.Maximizes the information content of the embeddings by penalizing redundancy. No negative samples required.Barlow Twins, VICReg

2. Masked Image Modeling (MIM)

Central idea: Reconstruction. The input image is split into patches, a significant portion of which (often ~75%) are masked. The model is then trained to predict the content of the masked patches based on the visible ones.

Prediction Targets: The model can be trained to predict various targets for the masked regions:

  • Pixel Reconstruction: Reconstructing the raw pixel values (e.g., MAE, SimMIM).
  • Feature Regression: Predicting abstract feature representations (e.g., MaskFeat).
  • Token Prediction: Predicting discrete visual tokens (e.g., BEiT).

3. Hybrid Architectures

Central idea: Combine the principles of masking and joint embedding

Examples:

  • Image-based Joint Embedding Predictive Architecture (I-JEPA)

Latest Articles

Mastering TerraMind: From Understanding to Fine-tuning

TerraMind is the first large-scale, any-to-any generative multimodal foundation model proposed for the Earth Observation (EO) field. It is pre-trained by combining token-level and pixel-level dual-scale representations to learn high-level contextual information and fine-grained spatial details. The model aims to facilitate multimodal data integration, provide powerful generative capabilities, and support zero-shot and few-shot applications, while outperforming existing models on Earth Observation benchmarks and further improving performance by introducing ‘Thinking in Modalities’ (TiM). [Read More]

Monte Carlo Sampling

Understand the core concepts of Monte Carlo: Law of Large Numbers, rejection sampling, importance sampling, variance reduction techniques (antithetic variates, control variates, stratified sampling). [Read More]

Introduction to MCMC

The reason we need MCMC is that many distributions are only known in their unnormalized form, making traditional sampling/integration methods ineffective. By constructing a ‘correct Markov chain’, we can obtain the target distribution from its stationary distribution, meaning the long-term distribution of the trajectory ≈ target distribution. [Read More]

title: “BYOL Explained: Self-Supervised Learning without Negative Pairs” date: 2025-10-08 summary: “Understanding BYOL: How interactions between Online and Target networks achieve SOTA performance without negative samples. A deep dive into the architecture and loss function.” series: [“Self-Supervised Learning”] tags: [“BYOL”, “Contrastive Learning”, “SSL”, “CV”, “Paper Notes”]

[Read More]

title: “SimCLR Explained: Contrastive Learning Design & Code” date: 2025-10-07 summary: “A detailed visual guide to SimCLR. Understand the logic behind stochastic data augmentation, the NT-Xent loss, and why contrastive learning works.” series: [“Self-Supervised Learning”] tags: [“SimCLR”, “Contrastive Learning”, “SSL”, “CV”, “Paper Notes”]

[Read More]