Contrastive Image Representation Learning with Debiasing for CTR Prediction
The article proposes a three-stage contrastive learning framework—pre‑training, fine‑tuning, and debiasing—to generate unbiased, fine‑grained image embeddings for mobile Taobao CTR prediction, achieving higher accuracy, fairness, and a 4‑5% CTR lift in large‑scale offline and online evaluations.
Users can search for products on mobile Taobao by uploading images, which convey richer intent than text. However, the complexity and diversity of visual content make accurate image understanding and product retrieval challenging. The "LiJingTuZhi" project aims to incorporate image semantic information into click‑through‑rate (CTR) prediction models.
The existing image representation models either rely on coarse category‑level supervision or on user‑behavior‑driven contrastive learning, both suffering from coarse granularity and sample‑selection bias, limiting downstream CTR ranking across the full product space.
To address these issues, a three‑stage framework is proposed: (1) S1 – Pre‑training using self‑supervised contrastive learning to obtain unbiased image embeddings; (2) S2 – Fine‑tuning with supervised contrastive learning on user click data to capture fine‑grained visual features; (3) S3 – Debiasing where a debias network removes residual sample‑selection bias by aligning head and tail item representations. The image encoder outputs semantic vectors that are fed into a CTR model together with user, query and context features.
In the S1 stage, each image in a batch serves as a positive sample for its augmented view and as a negative for other images; the loss minimizes 1 – cosine similarity between positive pairs. In S2, positive samples are constructed from items clicked by the user, while negatives are drawn from a class‑aware pool to avoid “false negatives”. The S2 loss is a standard contrastive objective on the fine‑tuned embeddings.
The S3 debias stage builds triplet samples using the unbiased S1 embeddings as a reference, and a gating network fuses the original and debiased features. The overall loss combines the debiasing term with the CTR cross‑entropy loss.
Experiments on a manually annotated retrieval dataset show that the proposed framework (S1+S2+D) outperforms baselines (ResNet‑C, S1, S2) on Hit Ratio, Long‑tail Recall and Category Recall, demonstrating both higher accuracy and fairness. Large‑scale online CTR prediction tests on a 1‑billion‑sample real‑world dataset reveal that S1+S2+D achieves the best AUC, and an A/B test reports a 4‑5% lift in CTR and a 1‑2% increase in RPM compared to the S2 baseline.
In conclusion, the contrastive pre‑train‑fine‑tune‑debias pipeline effectively mitigates sample‑selection bias, improves fine‑grained image semantics, and enhances CTR prediction performance and fairness. Future work will explore multimodal interest modeling, semantic‑traditional feature fusion, and joint training of representation and prediction models.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.