Multimodal Representations Boost Taobao Display Advertising CTR
Alibaba’s advertising team introduces semantic‑aware contrastive learning to pre‑train multimodal image‑text embeddings, integrates them via SimTier and MAKE into ID‑based CTR models, achieving up to 6.9% lift in Taobao display ad click‑through rates and improving long‑tail item performance.
This paper presents the latest advances of Alibaba’s advertising team in integrating multimodal content (images and text) into click‑through‑rate (CTR) estimation models for Taobao display ads.
Traditional large‑scale recommendation models rely on sparse ID features combined with MLPs, which cannot capture the semantic information of items. The authors identify two key challenges: (1) how multimodal data can improve model performance and how to design pre‑training tasks to obtain effective multimodal representations; (2) how to incorporate these representations into ID‑based estimation models.
To address these challenges, a semantic‑aware contrastive learning (SCL) pre‑training method is proposed. Positive pairs are constructed from user search‑to‑purchase behavior chains, ensuring that paired images or texts share true semantic similarity in the e‑commerce context. Negative samples are drawn from a large MoCo memory bank, and InfoNCE loss is used to pull together positive pairs while pushing apart negatives.
Two algorithms, SimTier and MAKE, are introduced to apply the learned multimodal embeddings in sequence modeling. SimTier discretizes the similarity scores between the target item and historical behavior items into a fixed‑size distribution vector, simplifying the modeling of semantic similarity. MAKE decouples the optimization of multimodal parameters from ID‑based parameters by pre‑training multimodal components over multiple epochs before joint training, thus mitigating the “one‑epoch” over‑fitting issue of ID‑only models.
Extensive experiments on CTR prediction show that SCL outperforms generic CLIP‑based pre‑training and other baselines. SimTier and MAKE each bring significant GAUC improvements over ID‑only baselines, and their combination yields an additional +1.25% GAUC and +0.75% AUC gain. The methods also enhance performance on long‑tail items.
In online deployment, a real‑time multimodal encoder service generates embeddings for newly created items within seconds, achieving >99% feature coverage and substantially reducing cold‑start latency. Since mid‑2023, multimodal features have been fully rolled out in Alibaba’s ad ranking pipeline, delivering up to +6.9% CTR lift for newly created ads.
The work demonstrates that multimodal representations can effectively complement ID features, providing a new growth curve for e‑commerce advertising models and opening avenues for future research such as multimodal long‑sequence modeling, integration with large‑language models, and generative recommendation.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.