MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

🏷️ Model Name

MoCo – Momentum Contrast for Unsupervised Visual Representation Learning

🧠 Core Idea

MoCo = SimCLR + Momentum Encoder + Queue

It stabilizes and scales contrastive learning by maintaining a dynamic dictionary with momentum-based updates, becoming a cornerstone for modern SSL methods.

MoCo architecture

πŸ–ΌοΈ Architecture

                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                      β”‚      Original Image x       β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β”‚                   β”‚
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚  Augmentation 1 β”‚  β”‚  Augmentation 2 β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚                   β”‚
                           β–Ό                   β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚ Encoder f_qβ”‚       β”‚ Encoder f_kβ”‚
                    β”‚ (query net)β”‚       β”‚ (key net)  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚                   β”‚
                           β–Ό                   β–Ό
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚   q (NxC)  β”‚       β”‚   k (NxC)  β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚                   β”‚
                           β”‚        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚        β”‚
                           β–Ό        β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚    Positive & Negative Logits      β”‚
              β”‚------------------------------------β”‚
              β”‚  l_pos = qΒ·k⁺       (positive)     β”‚
              β”‚  l_neg = qΒ·Queue(k⁻) (negatives)   β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚      Contrastive Loss (InfoNCE)    β”‚
              β”‚   L = CrossEntropy(logits / Ο„)     β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚     Backprop on f_q (SGD update)   β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚  Momentum Update of f_k Parameters β”‚
              β”‚  f_k = mΒ·f_k + (1βˆ’m)Β·f_q           β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚    Update Dynamic Queue (Dictionary)β”‚
              β”‚  enqueue(k_new), dequeue(k_oldest)  β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

And the pseudocode of MoCo in a PyTorch-like style (from the paper):

# f_q, f_k: encoder networks for query and key
# queue: dictionary as a queue of K keys (CxK)
# m: momentum
# t: temperature
f_k.params = f_q.params # initialize
for x in loader: # load a minibatch x with N samples
    x_q = aug(x) # a randomly augmented version
    x_k = aug(x) # another randomly augmented version
    q = f_q.forward(x_q) # queries: NxC
    k = f_k.forward(x_k) # keys: NxC
    k = k.detach() # no gradient to keys
    # positive logits: Nx1
    l_pos = bmm(q.view(N,1,C), k.view(N,C,1))
    # negative logits: NxK
    l_neg = mm(q.view(N,C), queue.view(C,K))
    # logits: Nx(1+K)
    logits = cat([l_pos, l_neg], dim=1)
    # contrastive loss, Eqn.(1)
    labels = zeros(N) # positives are the 0-th
    loss = CrossEntropyLoss(logits/t, labels)
    # SGD update: query network
    loss.backward()
    update(f_q.params)
    # momentum update: key network
    f_k.params = m*f_k.params+(1-m)*f_q.params
    # update dictionary
    enqueue(queue, k) # enqueue the current minibatch
    dequeue(queue) # dequeue the earliest minibatch

Pretext Task: MoCo utilizes a simple instance discrimination task: a query ($q$) and a key ($k$) form a positive pair if they are encoded views (different crops/augmentations) of the same image. The data augmentation involves random cropping, color distortions, and horizontal flipping.

1️⃣ Dual Encoder Structure ($f_q$ and $f_k$)

MoCo utilizes two main networks to encode the input data:

  • Query Encoder ($f_q$): Encodes the input query sample ($x_q$) into the query representation $q = f_q(x_q)$.
  • Momentum Key Encoder ($f_k$): Encodes the key sample ($x_k$) into the key representation $k = f_k(x_k)$.

Both encoders are typically standard convolutional networks, such as ResNet, and their output feature vectors (e.g., 128-D) are $\ell_2$-normalized before calculating the loss.

πŸ”₯ Shuffling Batch Normalization (BN)

To prevent the model from “cheating” the pretext task by exploiting information leakage between the query and key within a mini-batch (which can occur due to Batch Normalization statistics in distributed training), MoCo implements Shuffling BN.

  • For the key encoder ($f_k$), the sample order of the mini-batch is shuffled before being distributed among GPUs for BN calculation, ensuring the batch statistics used for the query and its positive key come from different subsets.

2️⃣ Contrastive Loss ($L_q$)

Loss Function: The learning objective is formulated using the InfoNCE loss (a form of contrastive loss):

$$L_q = βˆ’ \log \frac{\exp(q \cdot k^+/\tau)}{\sum_{i=0}^K \exp(q \cdot k_i/\tau)}$$

where $k^+$ is the positive key, ${k_i}$ includes $k^+$ and $K$ negative samples from the queue, and $\tau$ is a temperature hyper-parameter.

3️⃣ Momentum Update for Consistency

To ensure the large dictionary remains consistent despite its contents being encoded by different versions of the key encoder across multiple mini-batches, MoCo employs a momentum update mechanism:

  • Gradient Flow: Only the query encoder ($f_q$) is updated by standard back-propagation from the contrastive loss.
  • Smooth Key Update: The key encoder parameters ($\theta_k$) are updated as a moving average of the query encoder parameters ($\theta_q$) using a momentum coefficient $m \in [0, 1)$: $$\theta_k \leftarrow m\theta_k + (1βˆ’m)\theta_q.$$
  • Consistency: A relatively large momentum (e.g., $m=0.999$) is used, which ensures that the key encoder evolves slowly and smoothly. This slow progression is crucial for building a consistent dictionary, improving performance significantly over a key encoder that is copied or updated rapidly.

4️⃣ The Dynamic Dictionary (Queue)

MoCo maintains the dictionary as a queue of encoded key representations.

  • Decoupling Size: The queue mechanism decouples the dictionary size ($K$) from the mini-batch size ($N$), enabling the dictionary to be much larger than what GPU memory would typically allow for an end-to-end backpropagation setup.
  • Update Process: With every training iteration, the encoded representations of the current mini-batch are enqueued into the dictionary, and the oldest mini-batch is dequeued (removed). This progressive replacement ensures the dictionary samples are progressively updated.
  • Negative Samples: The key representations {k0, k1, k2, …} stored in this queue serve as the negative samples for the contrastive loss.

🎯 Downstream Tasks

  • Image Classification (Linear Evaluation Protocol)
  • Transfer Learning to Detection and Segmentation Tasks
    • Object Detection
    • Instance Segmentation
    • Keypoint Detection and Dense Pose Estimation
    • Semantic Segmentation
  • Fine-Grained Classification

πŸ’‘ Strengths

  • Memory-efficient, works with smaller batches
  • Large and Consistent Dictionary: MoCo is designed to build dictionaries that are both large and consistent as they evolve during training.
  • Architectural Flexibility: MoCo uses a standard ResNet-50 and requires no specific architecture designs (such as patchified inputs or tailored receptive fields). This non-customized architecture makes it easier to transfer the features to a variety of visual tasks.
  • Scalability to Uncurated Data: MoCo can work well in large-scale, relatively uncurated scenarios, such as when pre-trained on the billion-image Instagram-1B (IG-1B) dataset.

⚠️ Limitations

  • Contrastive pairs still need careful design
  • Sensitivity to Momentum Hyperparameter: The smooth evolution of the key encoder is essential. If the momentum coefficient ($m$) is too small (e.g., 0.9) or is set to zero (no momentum), the accuracy drops considerably or the training fails to converge, indicating a sensitivity to this core hyperparameter.
  • Computational Overhead for Dictionary Maintenance: Compared to end-to-end methods which only use the current mini-batch, MoCo requires extra computation to maintain the dynamic dictionary (queue).

πŸ“š Reference

  • He et al., 2019 [Momentum Contrast for Unsupervised Visual Representation Learning] πŸ”— arXiv:1911.05722
  • -Chen et al., 2020 [Improved Baselines with Momentum Contrastive Learning] πŸ”— arXiv:2003.04297
  • Github: moco

See also