Artificial Intelligence 11 min read

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models

The paper presents CAIG, a CTR‑driven advertising image generation pipeline that pre‑trains a multimodal LLM on e‑commerce data, trains a reward model on CTR‑labeled image pairs, and fine‑tunes generation via product‑centric preference optimization, achieving state‑of‑the‑art online and offline performance.

JD Retail Technology
JD Retail Technology
JD Retail Technology
CTR-Driven Advertising Image Generation Using Multimodal Large Language Models

This work, accepted to WWW2025, investigates the generation of advertising images for e‑commerce platforms by optimizing click‑through rate (CTR) as the primary objective. The authors explore the use of multimodal large language models (MLLMs) and introduce a novel reward model combined with a product‑centric preference optimization strategy, achieving state‑of‑the‑art performance on both online and offline metrics.

Background and Motivation : Existing ad‑image generation methods focus on aesthetic quality rather than online performance, leading to a gap between generated images and actual user preferences. Inspired by recent RLHF approaches, the authors propose training a reward model (RM) and fine‑tuning the generation model via reinforcement learning (RL) to reflect user click preferences.

Overall Solution (CAIG) : The proposed CTR‑driven Advertising Image Generation (CAIG) pipeline consists of (1) pre‑training a multimodal LLM on a large e‑commerce dataset (≈1.2 M samples) to inject domain knowledge, (2) training a reward model on paired ad images with CTR labels, and (3) applying a product‑centric preference optimization (PCPO) stage that uses the RM to fine‑tune a prompt model (PM) and generate ad images with Stable Diffusion + ControlNet.

E‑commerce Knowledge Pre‑training : Three pre‑training tasks are defined: image understanding, multimodal content understanding, and prompt generation. These tasks enable the MLLM to comprehend product images, textual attributes, and generate or rewrite prompts for background creation.

Reward Model Based on MLLM : CTR prediction is reformulated as a relative comparison between image pairs. The RM receives multimodal inputs (visual + textual) and outputs a binary preference and a point‑wise CTR regression. The loss combines binary cross‑entropy with a point‑level loss.

CTR‑Driven Optimization : The task is cast as a preference selection problem. Image pairs are generated, the RM ranks them, and the generation model is fine‑tuned using Direct Preference Optimization (DPO). To avoid over‑optimizing CTR at the expense of product‑background relevance, the authors introduce Product‑Centric Preference Optimization (PCPO), which treats product information as the sole variable and constructs additional preference pairs to enforce alignment.

Experimental Results :

Reward Model Performance: The proposed method outperforms both closed‑source (GLM4V, Claude 3.5 Sonnet, GPT‑4o, GPT‑4V) and open‑source (VAM, CG4CTR) baselines on commercial and public datasets.

Product‑Background Relevance: PCPO maintains higher matching rates across training epochs compared to standard DPO, demonstrating better preservation of product relevance.

Online Experiments: A week‑long live test on JD.com covering 44 product categories shows significant CTR lifts over baseline MLLM generation, confirming the practical impact of the CAIG approach.

Paper: https://arxiv.org/pdf/2502.06823

Code: https://github.com/Chenguoz/CAIG

e-commerceAIctrreinforcement learningmultimodal LLMad image generation
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.