AI-Generated E-commerce Advertising Images: Relationship-Aware Diffusion Models for Layout, Background, and Poster Generation
This article analyzes the challenges of manual e‑commerce ad image creation and presents JD's innovative AI solutions—including a relationship‑aware diffusion model for poster layout, a category‑common and personalized background generator, and an end‑to‑end planning‑and‑rendering framework—that achieve high‑quality automatic ad creative generation and boost advertising revenue.
Electronic commerce advertising images must capture consumer attention, convey brand values, and build emotional connections, yet most existing ad pictures rely on labor‑intensive manual design, limiting efficiency and cost. Recent advances in AIGC have not fully solved issues such as missing selling‑point information, poor scalability, and difficulty in personalized presentation.
To address these industry challenges, JD's advertising department introduced a series of innovative methods in 2023. First, a relationship‑aware diffusion model overlays selling‑point information onto manually created product images. Second, a background generation model that fuses category‑common and personalized styles enables large‑scale, personalized image generation. Finally, a planning‑and‑rendering poster generation model produces end‑to‑end creative images, resulting in high‑quality automatic ad creation and increased platform ad revenue.
2. Relationship‑Aware Diffusion Model for Poster Layout Generation
Poster layout generation aims to predict the positions and categories of visual elements on an image, which is crucial for aesthetic appeal and information transmission. Traditional methods focus only on geometric relationships and ignore visual content, leading to suboptimal results. JD proposes a diffusion‑based approach that treats layout generation as a noise‑to‑layout process, gradually denoising to produce the final layout.
During each sampling step, a set of Gaussian‑sampled boxes or the estimated boxes from the previous step are fed into the model. An image encoder extracts RoI features, which are then processed by a Visual‑Text Relation Awareness Module (VTRAM) that aligns visual and textual features, and a Geometric Relation Awareness Module (GRAM) that enhances RoI features based on relative positions. These modules enable users to control layout generation via predefined layouts or text changes.
2.2 Diffusion‑Based Poster Layout Generation
The diffusion model uses a Markov chain to convert noise into data samples. The process consists of a forward diffusion step that adds Gaussian noise to a deterministic layout and a reverse denoising step that gradually removes noise to obtain the final layout.
2.3 Visual‑Text Relation Awareness (VTRAM)
Instead of simply concatenating visual and textual features, VTRAM aligns them via cross‑attention. For each RoI feature \(V_i\) and language feature \(L\), positional embeddings are concatenated to form a visual‑position feature, which serves as the query while the language feature acts as key and value, producing a multimodal feature \(M_i\).
2.4 Geometric Relation Awareness (GRAM)
GRAM computes relative position features \(R_{ij}\) between pairs of RoIs and encodes them with sinusoidal embeddings to obtain geometric weight coefficients \(R_{p_{ij}}\). These weights are normalized via softmax and combined with visual embeddings to produce final geometric features \(T\).
3. Fusion of Category‑Common and Personalized Styles for Product Background Generation
Background generation aims to create natural, realistic backgrounds for product cut‑out images, improving click‑through rates. Existing methods fall into "text‑to‑image" (e.g., Stable Diffusion, ControlNet) and "image‑to‑image" approaches, each with limitations such as cumbersome prompt design or loss of fine‑grained layout information.
JD proposes a reference‑image‑based method that, given a product cut‑out, its category, and any other product ad image as reference, generates a background matching the reference's layout, composition, color, and style. The framework comprises three modules: a pretrained Stable Diffusion model, a Category‑Common Generator (CG) that extracts category information from the cut‑out, and a Personalized Generator (PG) that extracts style information from the reference image. CG and PG features are merged into the SD decoder to produce the final background.
3.1 Category‑Common Generation (CG)
CG receives the product cut‑out, a product prompt "A photo of C", and a background prompt "in the background of D" (where D encodes the category code). CG replaces the standard attention module with a mask‑aware attention that uses the product mask \(M\) to focus on product regions.
3.2 Personalized Style Generation (PG)
PG takes a reference image and its product mask, using a ControlNet‑like architecture without any textual prompt. PG outputs multi‑scale feature maps that are masked by the product mask to ensure style information only influences the background region.
4. End‑to‑End Product Poster Generation via Planning and Rendering
Poster generation requires coherent layout and harmonious background. JD introduces a two‑stage framework inspired by human designers: a planning stage (PlanNet) that predicts layout positions, and a rendering stage (RenderNet) that generates the final image.
4.1 Planning Network (PlanNet)
PlanNet encodes product images and textual content, then uses a Layout Decoder (two fully‑connected layers and N transformer blocks) to iteratively denoise a random layout into a refined layout. Each transformer block incorporates adaptive layer normalization, self‑attention, and cross‑attention with visual and language features.
4.2 Rendering Network (RenderNet)
RenderNet receives the layout mask \(L_m\) and product image, encoding them via a three‑layer convolutional network and a six‑layer visual encoder. A spatial‑fusion module merges layout and visual features, which are then fed into ControlNet to guide Stable Diffusion for final poster synthesis.
5. Summary & Outlook
JD's advertising department tackled the lack of selling‑point information, scalability, and personalization in AIGC for ads by (1) building a relationship‑aware diffusion model with VTRAM and GRAM for layout generation, (2) integrating category‑common and personalized style generators into diffusion models, and (3) proposing a Planning‑and‑Rendering (P&R) framework that jointly optimizes layout and background.
Future research directions include improving controllability of generated content, enhancing multimodal integration (text, image, video), and advancing personalization based on user data and behavior.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.