Artificial Intelligence 21 min read

Contrastive Learning for Text Generation: Motivation, Methodology, Experiments, and Discussion (CoNT Framework)

This article reviews the integration of contrastive learning into text generation, explains why it helps mitigate exposure bias, introduces the CoNT framework with three key improvements, presents extensive experiments on translation, summarization, code comment and data‑to‑text tasks, and discusses practical deployment considerations.

DataFunTalk
DataFunTalk
DataFunTalk
Contrastive Learning for Text Generation: Motivation, Methodology, Experiments, and Discussion (CoNT Framework)

Guest and Organizer : Speaker – An Chen‑Xin, Master’s student at Fudan University; Editor – Hu Ying, Guizhou University; Platform – DataFunTalk.

Motivation : Contrastive learning, widely successful in computer vision, can provide better representations for text generation tasks such as machine translation, summarization, and data‑to‑text. It helps alleviate exposure bias caused by the mismatch between training (teacher‑forcing) and inference (autoregressive decoding). Existing methods either rely on handcrafted negative samples or reinforcement‑learning‑style objectives, which are unstable or hard to implement.

How Contrastive Learning Addresses Exposure Bias : By exposing the decoder to both correct (positive) and erroneous (negative) samples during training, the model learns to distinguish high‑quality outputs from low‑quality ones without the instability of GANs or RL.

Simple Contrastive Scheme :

Adopt a SimCLR‑style approach: the ground‑truth target sentence is the positive sample, while other sentences in the same batch serve as negatives. The anchor is the source sequence representation.

Limitations of Random Negative Sampling :

Random negatives may be too easy, leading to weak representation learning. Larger batch sizes reduce the chance of selecting challenging positives.

Recent Improvements :

SSMBA : Add discrete perturbations (random masking) and use a masked language model to reconstruct masked tokens, generating new positives.

Dropout (SimCSE‑style) : Pass the ground‑truth through a decoder with dropout twice; the two outputs form a positive pair.

CLAPS : Perturb the embedding of the ground‑truth and use the magnitude of semantic change to define positives and negatives.

Remaining Bottlenecks :

Key challenges are (1) constructing meaningful positive/negative pairs, (2) choosing an appropriate contrastive loss (InfoNCE ignores inter‑negative relations), and (3) mismatch between training loss and decoding objective.

Proposed CoNT Framework :

Improvement 1 : Use model‑generated hypotheses (e.g., top‑k beam outputs) as contrastive samples.

Improvement 2 : Employ a triplet‑wise margin ranking loss, where the gold reference is the anchor, a model hypothesis is a negative, and another hypothesis from the same batch can serve as a positive relative to the anchor.

Improvement 3 : Combine a sequence‑similarity score with the standard likelihood during decoding, using a balance factor (typically 0.5).

The loss function can be expressed as: Loss = NLL + λ * TripletMarginLoss

Experiments :

Machine translation on IWSLT14 (De‑En), WMT16 (Ru‑En), and WMT14 (En‑De) shows CoNT outperforms pure MLE and NCE baselines, especially when better positive/negative construction is used.

Summarization on XSum and Multi‑News: CoNT gains >3 BLEU points over MLE and beats the previous best (CLAPS) by ~2 points. Similar gains are observed with PEGASUS.

Code comment generation (Python & Java) and structured data‑to‑text (WiKiBio, TOTTO) both achieve new state‑of‑the‑art results, often matching larger models while using a base T5.

CommonGen (knowledge‑grounded generation) shows a substantial margin over previous baselines.

Discussion :

Visualization reveals clearer decision boundaries for CoNT compared with vanilla MLE, indicating more discriminative representations.

Studying the impact of the similarity weight α shows that a balanced combination of likelihood and similarity yields the best performance; setting α to 0 or 1 degrades results.

Practical Integration :

To add CoNT to an existing MLE‑trained model, load the checkpoint, run inference to obtain hidden‑state vectors for each beam, compute pair‑wise cosine similarities, and combine them with the log‑probability using a balance factor.

Pros and Cons :

Negligible inference overhead (no extra FLOPs), making deployment easy.

Training is slower because (1) a warm‑up phase with pure NLL is required, (2) beam search during training is sequential and non‑parallel, and (3) computing similarity scores for many pairs is costly, especially on CPUs.

Trade‑off Strategies :

Reduce the proportion of model‑generated samples in each batch, increase the number of true batch samples.

Early‑stop contrastive training after the loss curve steeply declines (e.g., around 10k steps).

Assisted Decoding :

Current pipelines apply contrastive re‑ranking after beam search; future work could integrate similarity scoring every few decoding steps to guide search more effectively.

Q&A Highlights :

Sequence similarity is computed by pooling encoder outputs (source) and decoder hidden states (hypotheses) into fixed‑size vectors and measuring cosine similarity.

CoNT has not been evaluated on dialogue tasks due to mismatch between single‑turn training and multi‑turn inference.

Warm‑up should be run to convergence before adding contrastive loss to avoid excessive training time.

BLEU scores can be used as soft margins in the contrastive loss, but direct BLEU optimization is unstable.

CommonGen and CommonSense QA are typical benchmarks for factual/knowledge consistency; evaluation metrics include CIDER, SPICE, and FACTCC for summarization.

Thank you for attending.

AIcontrastive learningnatural language processingtext generationmachine translationCoNTsummarization
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.