Artificial Intelligence 18 min read

Contrastive Learning and Its Applications in Weibo Content Representation

This article explains the fundamentals of contrastive learning, reviews typical models such as SimCLR, MoCo, SwAV, BYOL, SimSiam and Barlow Twins, and demonstrates how these methods are applied to Weibo text and multimodal (text‑image) representation tasks like hashtag generation and image‑text matching.

DataFunSummit
DataFunSummit
DataFunSummit
Contrastive Learning and Its Applications in Weibo Content Representation

Contrastive learning is a self‑supervised paradigm that leverages large amounts of unlabeled data by constructing positive and negative pairs; it can be viewed as a self‑supervised version of metric learning.

An abstract contrastive learning system first creates positive and negative samples, encodes them with an encoder, and projects the embeddings onto a unit hypersphere. The optimization goal, often realized with the InfoNCE loss, pulls positive pairs together while pushing negative pairs apart.

Key design questions include how to construct positives, how to design the mapping function f (encoder + projector), and how to choose or design the loss function.

Typical image‑based contrastive models:

SimCLR – uses batch‑wise negatives and a dual‑tower encoder + projector architecture.

MoCo (and MoCo V2/V3) – maintains a large queue of negatives and employs momentum updates for the encoder.

SwAV – replaces explicit negatives with online clustering (k‑means) to avoid collapse.

BYOL – removes negatives entirely by using an asymmetric architecture with a predictor.

SimSiam – similar to BYOL but with a symmetric structure and a predictor to prevent collapse.

Barlow Twins – introduces a redundancy‑reduction loss that works without negatives.

Experimental results show that BYOL, SimCSE (the NLP counterpart of SimCLR), and SwAV achieve comparable performance and often outperform earlier methods.

Application to Weibo:

1. **Hashtag (Hash Tag) generation** – The CD‑TOM model extends SimCLR to a multimodal setting, using user‑provided hashtags as positives and learning embeddings for both text and tags with a BERT encoder. After training, tag embeddings are stored in a Faiss index; new posts retrieve the top‑3 most similar tags.

2. **Multimodal text‑image matching** – The W‑CLIP model adopts a dual‑tower architecture (BERT for text, ResNet for images) and optimizes the similarity of matching text‑image pairs using InfoNCE. It enables tasks such as recommending images for a given post or finding posts that match a given image.

Both models demonstrate robustness to noisy data and achieve strong empirical results, as shown by the presented experimental screenshots.

The article concludes with a Q&A session addressing topics such as the role of negatives in contrastive learning, the use of clustering in SwAV, and the applicability of contrastive methods to supervised NLP tasks.

Overall, the talk illustrates how contrastive learning can be adapted from vision to NLP and multimodal domains, providing practical solutions for large‑scale social media content representation.

contrastive learningMultimodalNLPself-supervised learningrepresentation learningWeibo
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.