MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

๐Ÿท๏ธ Model Name

MoCo โ€“ Momentum Contrast for Unsupervised Visual Representation Learning

๐Ÿง  Core Idea

MoCo = SimCLR + Momentum Encoder + Queue

It stabilizes and scales contrastive learning by maintaining a dynamic dictionary with momentum-based updates, becoming a cornerstone for modern SSL methods.

MoCo architecture

๐Ÿ–ผ๏ธ Architecture

                      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                      โ”‚      Original Image x       โ”‚
                      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                     โ”‚
                           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                           โ”‚                   โ”‚
                 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                 โ”‚  Augmentation 1 โ”‚  โ”‚  Augmentation 2 โ”‚
                 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚                   โ”‚
                           โ–ผ                   โ–ผ
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚ Encoder f_qโ”‚       โ”‚ Encoder f_kโ”‚
                    โ”‚ (query net)โ”‚       โ”‚ (key net)  โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚                   โ”‚
                           โ–ผ                   โ–ผ
                     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                     โ”‚   q (NxC)  โ”‚       โ”‚   k (NxC)  โ”‚
                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚                   โ”‚
                           โ”‚        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚        โ”‚
                           โ–ผ        โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚    Positive & Negative Logits      โ”‚
              โ”‚------------------------------------โ”‚
              โ”‚  l_pos = qยทkโบ       (positive)     โ”‚
              โ”‚  l_neg = qยทQueue(kโป) (negatives)   โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                           โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚      Contrastive Loss (InfoNCE)    โ”‚
              โ”‚   L = CrossEntropy(logits / ฯ„)     โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                           โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚     Backprop on f_q (SGD update)   โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                           โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚  Momentum Update of f_k Parameters โ”‚
              โ”‚  f_k = mยทf_k + (1โˆ’m)ยทf_q           โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                           โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚    Update Dynamic Queue (Dictionary)โ”‚
              โ”‚  enqueue(k_new), dequeue(k_oldest)  โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

And the pseudocode of MoCo in a PyTorch-like style (from the paper):

# f_q, f_k: encoder networks for query and key
# queue: dictionary as a queue of K keys (CxK)
# m: momentum
# t: temperature
f_k.params = f_q.params # initialize
for x in loader: # load a minibatch x with N samples
    x_q = aug(x) # a randomly augmented version
    x_k = aug(x) # another randomly augmented version
    q = f_q.forward(x_q) # queries: NxC
    k = f_k.forward(x_k) # keys: NxC
    k = k.detach() # no gradient to keys
    # positive logits: Nx1
    l_pos = bmm(q.view(N,1,C), k.view(N,C,1))
    # negative logits: NxK
    l_neg = mm(q.view(N,C), queue.view(C,K))
    # logits: Nx(1+K)
    logits = cat([l_pos, l_neg], dim=1)
    # contrastive loss, Eqn.(1)
    labels = zeros(N) # positives are the 0-th
    loss = CrossEntropyLoss(logits/t, labels)
    # SGD update: query network
    loss.backward()
    update(f_q.params)
    # momentum update: key network
    f_k.params = m*f_k.params+(1-m)*f_q.params
    # update dictionary
    enqueue(queue, k) # enqueue the current minibatch
    dequeue(queue) # dequeue the earliest minibatch

Pretext Task: MoCo utilizes a simple instance discrimination task: a query ($q$) and a key ($k$) form a positive pair if they are encoded views (different crops/augmentations) of the same image. The data augmentation involves random cropping, color distortions, and horizontal flipping.

1๏ธโƒฃ Dual Encoder Structure ($f_q$ and $f_k$)

MoCo utilizes two main networks to encode the input data:

  • Query Encoder ($f_q$): Encodes the input query sample ($x_q$) into the query representation $q = f_q(x_q)$.
  • Momentum Key Encoder ($f_k$): Encodes the key sample ($x_k$) into the key representation $k = f_k(x_k)$.

Both encoders are typically standard convolutional networks, such as ResNet, and their output feature vectors (e.g., 128-D) are $\ell_2$-normalized before calculating the loss.

๐Ÿ”ฅ Shuffling Batch Normalization (BN)

To prevent the model from “cheating” the pretext task by exploiting information leakage between the query and key within a mini-batch (which can occur due to Batch Normalization statistics in distributed training), MoCo implements Shuffling BN.

  • For the key encoder ($f_k$), the sample order of the mini-batch is shuffled before being distributed among GPUs for BN calculation, ensuring the batch statistics used for the query and its positive key come from different subsets.

2๏ธโƒฃ Contrastive Loss ($L_q$)

Loss Function: The learning objective is formulated using the InfoNCE loss (a form of contrastive loss):

$$L_q = โˆ’ \log \frac{\exp(q \cdot k^+/\tau)}{\sum_{i=0}^K \exp(q \cdot k_i/\tau)}$$

where $k^+$ is the positive key, ${k_i}$ includes $k^+$ and $K$ negative samples from the queue, and $\tau$ is a temperature hyper-parameter.

3๏ธโƒฃ Momentum Update for Consistency

To ensure the large dictionary remains consistent despite its contents being encoded by different versions of the key encoder across multiple mini-batches, MoCo employs a momentum update mechanism:

  • Gradient Flow: Only the query encoder ($f_q$) is updated by standard back-propagation from the contrastive loss.
  • Smooth Key Update: The key encoder parameters ($\theta_k$) are updated as a moving average of the query encoder parameters ($\theta_q$) using a momentum coefficient $m \in [0, 1)$: $$\theta_k \leftarrow m\theta_k + (1โˆ’m)\theta_q.$$
  • Consistency: A relatively large momentum (e.g., $m=0.999$) is used, which ensures that the key encoder evolves slowly and smoothly. This slow progression is crucial for building a consistent dictionary, improving performance significantly over a key encoder that is copied or updated rapidly.

4๏ธโƒฃ The Dynamic Dictionary (Queue)

MoCo maintains the dictionary as a queue of encoded key representations.

  • Decoupling Size: The queue mechanism decouples the dictionary size ($K$) from the mini-batch size ($N$), enabling the dictionary to be much larger than what GPU memory would typically allow for an end-to-end backpropagation setup.
  • Update Process: With every training iteration, the encoded representations of the current mini-batch are enqueued into the dictionary, and the oldest mini-batch is dequeued (removed). This progressive replacement ensures the dictionary samples are progressively updated.
  • Negative Samples: The key representations {k0, k1, k2, …} stored in this queue serve as the negative samples for the contrastive loss.

๐ŸŽฏ Downstream Tasks

  • Image Classification (Linear Evaluation Protocol)
  • Transfer Learning to Detection and Segmentation Tasks
    • Object Detection
    • Instance Segmentation
    • Keypoint Detection and Dense Pose Estimation
    • Semantic Segmentation
  • Fine-Grained Classification

๐Ÿ’ก Strengths

  • Memory-efficient, works with smaller batches
  • Large and Consistent Dictionary: MoCo is designed to build dictionaries that are both large and consistent as they evolve during training.
  • Architectural Flexibility: MoCo uses a standard ResNet-50 and requires no specific architecture designs (such as patchified inputs or tailored receptive fields). This non-customized architecture makes it easier to transfer the features to a variety of visual tasks.
  • Scalability to Uncurated Data: MoCo can work well in large-scale, relatively uncurated scenarios, such as when pre-trained on the billion-image Instagram-1B (IG-1B) dataset.

โš ๏ธ Limitations

  • Contrastive pairs still need careful design
  • Sensitivity to Momentum Hyperparameter: The smooth evolution of the key encoder is essential. If the momentum coefficient ($m$) is too small (e.g., 0.9) or is set to zero (no momentum), the accuracy drops considerably or the training fails to converge, indicating a sensitivity to this core hyperparameter.
  • Computational Overhead for Dictionary Maintenance: Compared to end-to-end methods which only use the current mini-batch, MoCo requires extra computation to maintain the dynamic dictionary (queue).

๐Ÿ“š Reference

  • He et al., 2019 [Momentum Contrast for Unsupervised Visual Representation Learning] ๐Ÿ”— arXiv:1911.05722
  • -Chen et al., 2020 [Improved Baselines with Momentum Contrastive Learning] ๐Ÿ”— arXiv:2003.04297
  • Github: moco

See also