Self-Supervised Learning -

Core Paradigms of Self-Supervised Learning

Three primary paradigms dominate the SSL landscape: Joint Embedding, Masked Image Modeling, and a hybrid approach that combines elements of both.

Joint Embedding vs. MIM:

Joint Embedding	Masked Image Modeling (MIM)
Pros:	Pros:
✓ Produces highly semantic features, great for classification.	✓ Conceptually simple, with no need for positive/negative pairs.
✓ Architecture agnostic.	✓ Masking reduces pre-training time.
✓ Achieves competitive results in linear probing evaluations.	✓ Achieves competitive results with fine-tuning.
	✓ Stronger fit for low-level tasks (e.g., denoising, super-resolution).
Cons:	Cons:
✗ May require very large batch sizes (e.g., SimCLR).	✗ Requires a Vision Transformer (ViT) backbone.
✗ Requires careful tuning of data augmentations.	✗ Weaker performance on abstract, high-level tasks like classification.
✗ Requires special mechanisms to handle negative samples or avoid collapse.
✗ Not well-suited for low-level tasks.

Joint Embedding

Masked Image Modeling (MIM)

Pros:

✓ Produces highly semantic features, great for classification.

✓ Conceptually simple, with no need for positive/negative pairs.

✓ Architecture agnostic.

✓ Masking reduces pre-training time.

✓ Achieves competitive results in linear probing evaluations.

✓ Achieves competitive results with fine-tuning.

✓ Stronger fit for low-level tasks (e.g., denoising, super-resolution).

Cons:

✗ May require very large batch sizes (e.g., SimCLR).

✗ Requires a Vision Transformer (ViT) backbone.

✗ Requires careful tuning of data augmentations.

✗ Weaker performance on abstract, high-level tasks like classification.

✗ Requires special mechanisms to handle negative samples or avoid collapse.

✗ Not well-suited for low-level tasks.

1. Joint Embedding Architectures

Central idea: Enforce invariance to data augmentations. A Siamese network architecture with shared parameters processes two different augmented “views” of the same image and trains the model to produce similar or identical embeddings for both.

Challenge: model collapse, where the network learns a trivial solution by mapping all inputs to the same constant vector.

Method Category	Core Idea	Key Characteristics	Examples
Contrastive	Pull positive pairs (views of the same image) close in the embedding space while pushing negative pairs (views of different images) apart.	Requires negative samples, which can necessitate large batch sizes. Good for multimodal data.	SimCLR, MoCo
Clustering	Learn embeddings by grouping similar samples into clusters without using explicit negative pairs.	Jointly learns feature representations and cluster assignments.	SwAV, Deep Cluster
Distillation	A “student” network is trained to match the output distribution of a “teacher” network on different augmented views.	Avoids collapse via an asymmetric architecture (student vs. teacher). The teacher is often updated via an Exponential Moving Average (EMA) of the student’s weights. Does not require negative samples.	BYOL, DINO
Regularization	Avoids collapse by imposing regularization terms on the embeddings, such as decorrelating feature dimensions.	Maximizes the information content of the embeddings by penalizing redundancy. No negative samples required.	Barlow Twins, VICReg

Method Category

Core Idea

Key Characteristics

Examples

Contrastive

Pull positive pairs (views of the same image) close in the embedding space while pushing negative pairs (views of different images) apart.

Requires negative samples, which can necessitate large batch sizes. Good for multimodal data.

SimCLR, MoCo

Clustering

Learn embeddings by grouping similar samples into clusters without using explicit negative pairs.

Jointly learns feature representations and cluster assignments.

SwAV, Deep Cluster

Distillation

A “student” network is trained to match the output distribution of a “teacher” network on different augmented views.

Avoids collapse via an asymmetric architecture (student vs. teacher). The teacher is often updated via an Exponential Moving Average (EMA) of the student’s weights. Does not require negative samples.

BYOL, DINO

Regularization

Avoids collapse by imposing regularization terms on the embeddings, such as decorrelating feature dimensions.

Maximizes the information content of the embeddings by penalizing redundancy. No negative samples required.

Barlow Twins, VICReg

2. Masked Image Modeling (MIM)

Central idea: Reconstruction. The input image is split into patches, a significant portion of which (often ~75%) are masked. The model is then trained to predict the content of the masked patches based on the visible ones.

Prediction Targets: The model can be trained to predict various targets for the masked regions:

Pixel Reconstruction: Reconstructing the raw pixel values (e.g., MAE, SimMIM).

Feature Regression: Predicting abstract feature representations (e.g., MaskFeat).

Token Prediction: Predicting discrete visual tokens (e.g., BEiT).

3. Hybrid Architectures

Central idea: Combine the principles of masking and joint embedding

Examples:

Image-based Joint Embedding Predictive Architecture (I-JEPA)

I-JEPA: Image-based Joint Embedding Predictive Architecture

Posted on October 9, 2025 | 113 words

A non-generative, self-supervised framework predicting high-level feature representations of masked regions from visible context, enabling scalable and efficient visual pretraining. [Read More]

MaskFeat: Masked Feature Prediction for Self-Supervised Visual Pre-Training

Posted on October 9, 2025 | 1026 words

Predict handcrafted features (e.g., HOG) of masked regions instead of raw pixels. [Read More]

SSL Vision Representation Learning Masked Image Modeling

BYOL: Bootstrap Your Own Latent

Posted on October 8, 2025 | 952 words

Learn representations by predicting one network’s output from another’s, without using negative samples. [Read More]

SSL Vision Representation Learning Joint Embedding Distillation Methods

DINO: Self-Distillation with No Labels

Posted on October 8, 2025 | 1257 words

A student network learns from a teacher network using self-distillation, producing emergent semantic attention maps. [Read More]

SSL Vision Representation Learning Joint Embedding Distillation Methods

MAE: Masked Autoencoders Are Scalable Vision Learners

Posted on October 8, 2025 | 1330 words

Randomly mask image patches and reconstruct the missing ones to learn context-aware visual representations. [Read More]

SSL Vision Representation Learning Masked Image Modeling

SwAV: Swapping Assignments between Views

Posted on October 8, 2025 | 1276 words

Simultaneously cluster the data and learn visual representations by enforcing consistency between cluster assignments, or ‘codes’, generated from different augmented views of the same image. [Read More]

SSL Vision Representation Learning Joint Embedding Clustering Methods

MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

Posted on October 7, 2025 | 1055 words

It stabilizes and scales contrastive learning by maintaining a dynamic dictionary with momentum-based updates, becoming a cornerstone for modern SSL methods. [Read More]

SSL Vision Representation Learning Joint Embedding Contrastive Methods

SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

Posted on October 7, 2025 | 1029 words • Other languages: CH

Learn invariant representations by maximizing similarity between augmented views of the same image while contrasting with others. [Read More]

SSL Vision Representation Learning Joint Embedding Contrastive Methods