Artificial Intelligence 21 min read

Multimodal Representations Boost Taobao Display Advertising CTR

Alibaba’s advertising team introduces semantic‑aware contrastive learning to pre‑train multimodal image‑text embeddings, integrates them via SimTier and MAKE into ID‑based CTR models, achieving up to 6.9% lift in Taobao display ad click‑through rates and improving long‑tail item performance.

Alimama Tech
Alimama Tech
Alimama Tech
Multimodal Representations Boost Taobao Display Advertising CTR

This paper presents the latest advances of Alibaba’s advertising team in integrating multimodal content (images and text) into click‑through‑rate (CTR) estimation models for Taobao display ads.

Traditional large‑scale recommendation models rely on sparse ID features combined with MLPs, which cannot capture the semantic information of items. The authors identify two key challenges: (1) how multimodal data can improve model performance and how to design pre‑training tasks to obtain effective multimodal representations; (2) how to incorporate these representations into ID‑based estimation models.

To address these challenges, a semantic‑aware contrastive learning (SCL) pre‑training method is proposed. Positive pairs are constructed from user search‑to‑purchase behavior chains, ensuring that paired images or texts share true semantic similarity in the e‑commerce context. Negative samples are drawn from a large MoCo memory bank, and InfoNCE loss is used to pull together positive pairs while pushing apart negatives.

Two algorithms, SimTier and MAKE, are introduced to apply the learned multimodal embeddings in sequence modeling. SimTier discretizes the similarity scores between the target item and historical behavior items into a fixed‑size distribution vector, simplifying the modeling of semantic similarity. MAKE decouples the optimization of multimodal parameters from ID‑based parameters by pre‑training multimodal components over multiple epochs before joint training, thus mitigating the “one‑epoch” over‑fitting issue of ID‑only models.

Extensive experiments on CTR prediction show that SCL outperforms generic CLIP‑based pre‑training and other baselines. SimTier and MAKE each bring significant GAUC improvements over ID‑only baselines, and their combination yields an additional +1.25% GAUC and +0.75% AUC gain. The methods also enhance performance on long‑tail items.

In online deployment, a real‑time multimodal encoder service generates embeddings for newly created items within seconds, achieving >99% feature coverage and substantially reducing cold‑start latency. Since mid‑2023, multimodal features have been fully rolled out in Alibaba’s ad ranking pipeline, delivering up to +6.9% CTR lift for newly created ads.

The work demonstrates that multimodal representations can effectively complement ID features, providing a new growth curve for e‑commerce advertising models and opening avenues for future research such as multimodal long‑sequence modeling, integration with large‑language models, and generative recommendation.

e-commercecontrastive learningCTR predictionRecommendation systemsmultimodal learning
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.