Artificial Intelligence 16 min read

PC²: Pseudo‑Classification Based Pseudo‑Captioning for Noisy Correspondence Learning in Cross‑Modal Retrieval

The paper introduces PC², a novel framework that combines pseudo‑classification and pseudo‑captioning to mitigate noisy correspondence in cross‑modal retrieval, presents a large‑scale web‑page/image‑meta‑description dataset called Noise of Web (NoW), and demonstrates significant performance gains on multiple benchmark datasets including Flickr30K, MS‑COCO, and the newly released NoW.

AntTech
AntTech
AntTech
PC²: Pseudo‑Classification Based Pseudo‑Captioning for Noisy Correspondence Learning in Cross‑Modal Retrieval

Cross‑modal retrieval models aim to bridge different media modalities (image, video, text, audio) for efficient matching and search. However, noisy correspondence—where paired data have mismatched semantics—poses a major challenge, especially in industrial settings with large‑scale, automatically collected data.

To address this, researchers from Ant Security Tian‑suan Lab and Nanjing University propose PC² (Pseudo‑Classification based Pseudo‑Captioning), a framework that improves robustness against noisy correspondence (NCL). PC² consists of three modules: (1) a pseudo‑classification task that treats generated image titles as class labels and trains a classifier with cross‑entropy loss, (2) a pseudo‑captioning step that assigns informative pseudo‑titles to noisy image‑text pairs based on similarity of pseudo‑predictions, and (3) a prediction‑oscillation based correspondence rectification module that uses the stability of pseudo‑predictions across epochs to adjust triplet loss margins.

The authors also construct a new benchmark dataset, Noise of Web (NoW), containing 100 K web‑page image and meta‑description pairs (98 K for training, 1 K each for validation and test). NoW is automatically collected, exhibits a realistic noise ratio of ~70 %, and provides pre‑extracted image features using a mobile‑UI‑trained detector (APT) to ensure fair comparison.

Experimental results on several NCL datasets (Flickr30K, MS‑COCO) and the proposed NoW benchmark show that PC² consistently outperforms existing margin‑based methods, achieving notable improvements in retrieval metrics (e.g., +11.2 points on NoW). Ablation studies confirm the contribution of each component: pseudo‑classification (P‑Cls), pseudo‑captioning (P‑Cap), and correspondence rectification (CR).

The paper concludes that PC² effectively enhances cross‑modal retrieval under noisy conditions and that the released NoW dataset provides a valuable resource for the NCL community. Future work includes extending PC² to other multimodal tasks and integrating it into large‑scale image‑text foundation models for downstream applications such as e‑commerce product search and risk monitoring.

Paper link: https://arxiv.org/pdf/2408.01349 Code link: https://github.com/alipay/PC2-NoiseofWeb Dataset link: https://huggingface.co/datasets/NJUyued/NoW

multimodal learningdatasetcross-modal retrievalnoisy correspondencePC2pseudo-captioning
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.