Artificial Intelligence 21 min read

AI-Generated E-commerce Advertising Images: Relationship-Aware Diffusion Models for Layout, Background, and Poster Generation

This article analyzes the challenges of manual e‑commerce ad image creation and presents JD's innovative AI solutions—including a relationship‑aware diffusion model for poster layout, a category‑common and personalized background generator, and an end‑to‑end planning‑and‑rendering framework—that achieve high‑quality automatic ad creative generation and boost advertising revenue.

JD Retail Technology

Apr 10, 2024

AI-Generated E-commerce Advertising Images: Relationship-Aware Diffusion Models for Layout, Background, and Poster Generation

Electronic commerce advertising images must capture consumer attention, convey brand values, and build emotional connections, yet most existing ad pictures rely on labor‑intensive manual design, limiting efficiency and cost. Recent advances in AIGC have not fully solved issues such as missing selling‑point information, poor scalability, and difficulty in personalized presentation.

To address these industry challenges, JD's advertising department introduced a series of innovative methods in 2023. First, a relationship‑aware diffusion model overlays selling‑point information onto manually created product images. Second, a background generation model that fuses category‑common and personalized styles enables large‑scale, personalized image generation. Finally, a planning‑and‑rendering poster generation model produces end‑to‑end creative images, resulting in high‑quality automatic ad creation and increased platform ad revenue.

2. Relationship‑Aware Diffusion Model for Poster Layout Generation

Poster layout generation aims to predict the positions and categories of visual elements on an image, which is crucial for aesthetic appeal and information transmission. Traditional methods focus only on geometric relationships and ignore visual content, leading to suboptimal results. JD proposes a diffusion‑based approach that treats layout generation as a noise‑to‑layout process, gradually denoising to produce the final layout.

During each sampling step, a set of Gaussian‑sampled boxes or the estimated boxes from the previous step are fed into the model. An image encoder extracts RoI features, which are then processed by a Visual‑Text Relation Awareness Module (VTRAM) that aligns visual and textual features, and a Geometric Relation Awareness Module (GRAM) that enhances RoI features based on relative positions. These modules enable users to control layout generation via predefined layouts or text changes.

2.2 Diffusion‑Based Poster Layout Generation

The diffusion model uses a Markov chain to convert noise into data samples. The process consists of a forward diffusion step that adds Gaussian noise to a deterministic layout and a reverse denoising step that gradually removes noise to obtain the final layout.

2.3 Visual‑Text Relation Awareness (VTRAM)

Instead of simply concatenating visual and textual features, VTRAM aligns them via cross‑attention. For each RoI feature \(V_i\) and language feature \(L\), positional embeddings are concatenated to form a visual‑position feature, which serves as the query while the language feature acts as key and value, producing a multimodal feature \(M_i\).

2.4 Geometric Relation Awareness (GRAM)

GRAM computes relative position features \(R_{ij}\) between pairs of RoIs and encodes them with sinusoidal embeddings to obtain geometric weight coefficients \(R_{p_{ij}}\). These weights are normalized via softmax and combined with visual embeddings to produce final geometric features \(T\).

3. Fusion of Category‑Common and Personalized Styles for Product Background Generation

Background generation aims to create natural, realistic backgrounds for product cut‑out images, improving click‑through rates. Existing methods fall into "text‑to‑image" (e.g., Stable Diffusion, ControlNet) and "image‑to‑image" approaches, each with limitations such as cumbersome prompt design or loss of fine‑grained layout information.

JD proposes a reference‑image‑based method that, given a product cut‑out, its category, and any other product ad image as reference, generates a background matching the reference's layout, composition, color, and style. The framework comprises three modules: a pretrained Stable Diffusion model, a Category‑Common Generator (CG) that extracts category information from the cut‑out, and a Personalized Generator (PG) that extracts style information from the reference image. CG and PG features are merged into the SD decoder to produce the final background.

3.1 Category‑Common Generation (CG)

CG receives the product cut‑out, a product prompt "A photo of C", and a background prompt "in the background of D" (where D encodes the category code). CG replaces the standard attention module with a mask‑aware attention that uses the product mask \(M\) to focus on product regions.

3.2 Personalized Style Generation (PG)

PG takes a reference image and its product mask, using a ControlNet‑like architecture without any textual prompt. PG outputs multi‑scale feature maps that are masked by the product mask to ensure style information only influences the background region.

4. End‑to‑End Product Poster Generation via Planning and Rendering

Poster generation requires coherent layout and harmonious background. JD introduces a two‑stage framework inspired by human designers: a planning stage (PlanNet) that predicts layout positions, and a rendering stage (RenderNet) that generates the final image.

4.1 Planning Network (PlanNet)

PlanNet encodes product images and textual content, then uses a Layout Decoder (two fully‑connected layers and N transformer blocks) to iteratively denoise a random layout into a refined layout. Each transformer block incorporates adaptive layer normalization, self‑attention, and cross‑attention with visual and language features.

4.2 Rendering Network (RenderNet)

RenderNet receives the layout mask \(L_m\) and product image, encoding them via a three‑layer convolutional network and a six‑layer visual encoder. A spatial‑fusion module merges layout and visual features, which are then fed into ControlNet to guide Stable Diffusion for final poster synthesis.

5. Summary & Outlook

JD's advertising department tackled the lack of selling‑point information, scalability, and personalization in AIGC for ads by (1) building a relationship‑aware diffusion model with VTRAM and GRAM for layout generation, (2) integrating category‑common and personalized style generators into diffusion models, and (3) proposing a Planning‑and‑Rendering (P&R) framework that jointly optimizes layout and background.

Future research directions include improving controllability of generated content, enhancing multimodal integration (text, image, video), and advancing personalization based on user data and behavior.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI diffusion model image generation

Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.