Artificial Intelligence 14 min read

Contrastive Image Representation Learning with Debiasing for CTR Prediction

The article proposes a three-stage contrastive learning framework—pre‑training, fine‑tuning, and debiasing—to generate unbiased, fine‑grained image embeddings for mobile Taobao CTR prediction, achieving higher accuracy, fairness, and a 4‑5% CTR lift in large‑scale offline and online evaluations.

Alimama Tech

Dec 14, 2022

Contrastive Image Representation Learning with Debiasing for CTR Prediction

Users can search for products on mobile Taobao by uploading images, which convey richer intent than text. However, the complexity and diversity of visual content make accurate image understanding and product retrieval challenging. The "LiJingTuZhi" project aims to incorporate image semantic information into click‑through‑rate (CTR) prediction models.

The existing image representation models either rely on coarse category‑level supervision or on user‑behavior‑driven contrastive learning, both suffering from coarse granularity and sample‑selection bias, limiting downstream CTR ranking across the full product space.

To address these issues, a three‑stage framework is proposed: (1) S1 – Pre‑training using self‑supervised contrastive learning to obtain unbiased image embeddings; (2) S2 – Fine‑tuning with supervised contrastive learning on user click data to capture fine‑grained visual features; (3) S3 – Debiasing where a debias network removes residual sample‑selection bias by aligning head and tail item representations. The image encoder outputs semantic vectors that are fed into a CTR model together with user, query and context features.

In the S1 stage, each image in a batch serves as a positive sample for its augmented view and as a negative for other images; the loss minimizes 1 – cosine similarity between positive pairs. In S2, positive samples are constructed from items clicked by the user, while negatives are drawn from a class‑aware pool to avoid “false negatives”. The S2 loss is a standard contrastive objective on the fine‑tuned embeddings.

The S3 debias stage builds triplet samples using the unbiased S1 embeddings as a reference, and a gating network fuses the original and debiased features. The overall loss combines the debiasing term with the CTR cross‑entropy loss.

Experiments on a manually annotated retrieval dataset show that the proposed framework (S1+S2+D) outperforms baselines (ResNet‑C, S1, S2) on Hit Ratio, Long‑tail Recall and Category Recall, demonstrating both higher accuracy and fairness. Large‑scale online CTR prediction tests on a 1‑billion‑sample real‑world dataset reveal that S1+S2+D achieves the best AUC, and an A/B test reports a 4‑5% lift in CTR and a 1‑2% increase in RPM compared to the S2 baseline.

In conclusion, the contrastive pre‑train‑fine‑tune‑debias pipeline effectively mitigates sample‑selection bias, improves fine‑grained image semantics, and enhances CTR prediction performance and fairness. Future work will explore multimodal interest modeling, semantic‑traditional feature fusion, and joint training of representation and prediction models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning contrastive learning CTR prediction bias mitigation image representation

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.