MaskFeat: Masked Feature Prediction for Self-Supervised Visual Pre-Training -

🏷️ Model Name

MaskFeat – Masked Feature Prediction for Self-Supervised Visual Pre-Training

🧠 Core Idea

Predict handcrafted features (e.g., HOG) of masked regions instead of raw pixels.

🖼️ Architecture

MaskFeat architecture

Pseudocode of MaskFeat:

# simplified MaskFeat pseudocode
for batch in loader:               # video clips or image patches
    tokens = patchify(batch)       # shape: (B, T, H, W, C)->tokens
    mask = sample_mask(tokens, ratio=0.4, blockwise=True)  # cube/block masking
    masked_tokens = replace_with_mask_token(tokens, mask)

    # forward
    outputs = vision_transformer(masked_tokens)   # outputs for all token positions

    # compute targets from original (unmasked) inputs
    target_features = compute_target_features(original_batch)  # e.g., HOG per patch

    # take outputs only at masked positions & compute regression loss
    pred = project_to_target_dim(outputs[mask_positions])
    loss = L2(pred, target_features[mask_positions])

    loss.backward(); optimizer.step()

1️⃣ Input Preparation and Tokenization

The input is a video (or an image, which is treated as a video with a single frame).

The video is initially divided into regular space-time cubes.
These cubes are then projected (e.g., via convolution) to form a sequence of tokens.

2️⃣ Masking

A portion of these input tokens (representing space-time cubes) is randomly masked out.
The masked tokens are replaced with a [MASK] token, which is a shared, learnable embedding indicating the masked regions.
- For spatiotemporal video data, cube masking is used, involving sampling random cubes that include both spatial and temporal blocks. MaskFeat commonly uses a 40% masking ratio.

3️⃣ Encoder Processing (Transformer Backbone)

The sequence of tokens, including the encoded visible patches and the [MASK] tokens, is input to a Vision Transformer (ViT) backbone.

Positional embeddings are added to the token sequence before processing by the Transformer.
The Transformer backbone processes this masked space-time input.

4️⃣ Prediction Head

The output tokens from the Transformer backbone corresponding to the masked cubes are passed through a linear layer.

This linear layer projects the tokens into the prediction space, where the number of output channels is adjusted to match the dimensionality of the specific target feature being predicted.

5️⃣ Target Feature Generation

The prediction target is a feature extracted from the corresponding region of the original, intact sample. This feature serves as the supervision.

MaskFeat typically uses Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, as the effective target feature.
For video input, the prediction target for a masked space-time cube is typically the HOG feature of the 2-D spatial patch temporally centered within that cube.

6️⃣ Loss Calculation and Optimization

The model is trained to minimize the difference between the predicted feature and the target feature.

The loss function (a regression loss, e.g., mean squared error) is computed only on the masked cubes.
The model’s weights (the Transformer backbone) are optimized to minimize this loss.

🎉 Post-training Usage

After pre-training is complete, the Transformer (encoder) is fine-tuned on downstream tasks (such as video classification or action detection).

🎯 Downstream Tasks

Video Recognition/Classification (Kinetics Datasets)
Action Detection: This task involves spatio-temporal localization of human actions
Human-Object Interaction Classification: This task requires fine-grained motion distinctions and temporal modeling
Image Recognition/Classification: MaskFeat can be generalized to the image domain, treating an image as a video with a single frame

💡 Strengths

Architectural Simplicity and Efficiency
- The approach is conceptually and practically simple.
- It typically uses a **single network with a single view **of each sample, unlike contrastive methods which require a Siamese structure and multiple views.
- When using Histograms of Oriented Gradients (HOG) as the target feature, the method is efficient and does not rely on an external model. This avoids the need for an extra pre-training stage (such as dVAE tokenization used by BEiT) and the resulting computational overhead.
- HOG calculation itself is cheap and introduces negligible computational overhead.
Performance and Scalability
- MaskFeat learns abundant visual knowledge.
- It is scalable to large models for both video and image domains. Larger models benefit significantly more from longer pre-training schedules, indicating the scalability of the MaskFeat task to high-capacity models (e.g., MViT-L vs. MViT-S).
- For video understanding, MaskFeat has been shown to close the large performance gap (over 5% accuracy) between supervised pre-training on massive image datasets (like IN-21K) and models trained solely on unlabeled videos.
- It achieves unprecedented results on video benchmarks such as Kinetics, AVA (Action detection), and SSv2 (Human-object interaction classification).
- The strong performance on localization-sensitive tasks like AVA suggests a clear advantage of masked modeling on video over supervised image pre-training.
- It successfully uses continuous feature regression (like HOG) as a target, demonstrating that discretization (tokenization) of visual signals is not necessary for effective masked visual prediction.
- The method is effective when coupled with the end-to-end fine-tuning protocol, maximizing final model performance for downstream tasks.
Robustness and Flexibility
- MaskFeat generally works fairly well with minimal augmentation, and stronger augmentations (like RandAugment or color jittering) sometimes degrade results.
- In the video domain, MaskFeat is robust across a wide range of masking ratios (from 40% up to 80%).
- HOG is a particularly effective target feature because its implementation uses local contrast normalization, which is essential for good results and provides invariance to photometric changes.

⚠️ Limitations

Feature Properties (Linear Separability)
- The representations learned by MaskFeat tend to be less linearly separable.
- Linear probing results, a common metric for evaluating self-supervised methods, lag significantly behind contrastive learning methods like MoCo v3.
- MaskFeat learns good visual knowledge revealed by fine-tuning but does not learn linearly separable features.
Architectural Constraints and Performance Trade-offs
- Masked Image Modeling (MIM) approaches generally require a ViT backbone.
- MIM approaches are often deemed weaker for abstract (higher-level) tasks such as classification compared to Joint Embedding methods.
- In the image domain, optimal masking ratios are sensitive: accuracy degrades when masking out too high a percentage of patches (e.g., above 40% with block-wise masking in ViT-B image pre-training).
Dependence on Target Feature Choice
- The performance critically depends on selecting an appropriate prediction target (e.g., HOG). Prediction targets utilizing human annotations, such as supervised features (from ResNet or ViT) or pseudo-labels, yield poor results and lead to overfitting.
- When attempting multi-task learning by combining targets (e.g., pixel color and HOG), the objectives may conflict, resulting in lower performance compared to using HOG alone. Pixel targets specifically introduce ambiguities related to color and texture.

📚 Reference

Wei et al., 2022 [Masked Feature Prediction for Self-Supervised Visual Pre-Training] 🔗 arXiv:2112.09133