Artificial Intelligence 12 min read

Self‑Supervised Learning: Contrastive Methods and the MoCo Series (V1‑V3)

This article introduces the four types of machine learning, explains self‑supervised learning, details generative and contrastive approaches, and provides an in‑depth overview of the MoCo series (V1‑V3), including their architectures, training strategies, and experimental results on document image classification and text‑line detection tasks.

Laiye Technology Team

Apr 1, 2022

Self‑Supervised Learning: Contrastive Methods and the MoCo Series (V1‑V3)

Machine learning algorithms can be divided into supervised learning, unsupervised learning, self‑supervised learning, and reinforcement learning.

Supervised learning : performance depends on the amount of labeled training data, which is costly to obtain.

Unsupervised learning : currently limited in the range of problems it can solve, as is reinforcement learning.

Self‑supervised learning : does not require manually labeled data and can be fine‑tuned for almost any downstream task, making it a hot research area.

What is self‑supervised learning? It is a machine‑learning paradigm that learns from the intrinsic structure of the data itself without relying on human‑annotated labels, aiming to acquire a universal feature representation that can be transferred to downstream tasks.

Self‑supervised methods are broadly classified into two categories: generative (predictive) methods and contrastive methods.

Generative (predictive) methods encode an input and then decode it to reproduce the original. Examples include BERT’s masked language modeling (MLM) in NLP, where random tokens are masked and the model predicts them, and Masked AutoEncoders (MAE) in computer vision, which mask image patches and reconstruct them pixel‑wise.

Contrastive methods construct positive and negative sample pairs, learning an encoder that pulls positive pairs together in the representation space while pushing negative pairs apart. This principle is often implemented with an InfoNCE loss, treating each batch as a (K+1)‑class classification problem.

To optimize the encoder, a softmax‑based cross‑entropy loss (InfoNCE) is used.

In the original MoCo V1 paper, the authors propose Momentum Contrast (MoCo) to learn visual representations. The key ideas are:

Positive pairs are two random crops from the same image; negative pairs are crops from different images.

A large dictionary (queue) stores key vectors, allowing many negative samples beyond the current batch.

Encoder q (query) and encoder k (key) are updated with a momentum encoder to keep the dictionary consistent.

MoCo V2 builds on V1 by incorporating ideas from SimCLR, such as larger batch sizes, stronger data augmentations (e.g., Gaussian blur), and an additional MLP projection head after the encoder. These changes improve performance on ImageNet.

MoCo V3 adapts the contrastive framework to Vision Transformers (ViT). Instead of a memory queue, V3 relies on a very large batch size to provide sufficient negative samples, which raises hardware requirements. Experiments show that MoCo V3 improves ViT performance on several datasets compared with random initialization and supervised pre‑training.

Practical usage

Two downstream tasks were evaluated using MoCo‑pre‑trained ResNet‑18 backbones:

Document image classification : a three‑class model (invoice, customs declaration, generic document) was trained with either a ResNet‑18 pretrained on ImageNet or a ResNet‑18 pretrained with MoCo (queue size 15 w or 45 w). The MoCo‑pretrained backbone achieved a 3.7 % higher average accuracy.

Text‑line detection (OCR component) : the same two backbones were compared. The MoCo‑pretrained model (45 w) reached an F1 score of 80.86 %, only 0.24 % lower than the supervised ImageNet backbone (81.10 %).

Training details for the self‑supervised + fine‑tune pipeline included SGD with 0.9 momentum, an initial learning rate of 0.03, MultiStepLR milestones at [160, 280, 640], batch size 512, and a queue size of 65 536. The model was trained for 388 epochs on a 1.4 M document‑image dataset.

Results for the text‑line detection task are summarized in the table below:

Backbone

Recall

Precision

ResNet18 (ImageNet)

0.7864

0.8373

0.811

ResNet18 (MoCo 15w)

0.7806

0.8286

0.8039

ResNet18 (MoCo 45w)

0.7854

0.8332

0.8086

The references cited include the original BERT paper, MAE, MoCo (V1‑V2), SimCLR, and the MoCo‑ViT study.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

contrastive learning MoCo document image classification text line detection Vision Transformers

Written by

Laiye Technology Team

Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.