Artificial Intelligence 18 min read

PosterMaker: High-Quality Product Poster Generation with Accurate Text Rendering

PosterMaker leverages a ControlNet‑based TextRenderNet with character‑level visual features and a reward‑driven foreground‑extension detector to generate high‑quality product posters that accurately render Chinese text (over 90% sentence accuracy) while preserving product fidelity, and is already deployed in Alibaba’s AI creative tool.

Alimama Tech

Apr 17, 2025

PosterMaker: High-Quality Product Poster Generation with Accurate Text Rendering

1. Introduction

Product image‑text posters are crucial for e‑commerce platforms. Creating high‑quality posters requires placing a product in a suitable background, selecting appropriate fonts and colors, and ensuring the text is readable and well‑aligned with the scene. This process is labor‑intensive for small merchants. Recent advances in text‑to‑image (T2I) diffusion models have motivated research on automatic poster generation. This work focuses on the task of generating a product poster given a background prompt, a foreground product image, and desired textual content with placement specifications.

Direct pipelines that first synthesize a text‑free scene and then render text suffer from a lack of training data for font, color, and style attributes. Instead, we adopt a single‑stage pixel‑wise generation approach, which learns the distribution of professionally designed posters. The main challenge is achieving precise text rendering, especially for Chinese characters, which are numerous and structurally complex. We propose to use robust character‑level visual representations as control signals and introduce TextRenderNet, a ControlNet‑based module built on Stable Diffusion 3 (SD3) that renders visual text with high accuracy.

Preserving the fidelity of the foreground product is another key issue. Inpainting‑based background generation keeps the foreground unchanged, but can produce “foreground extension” artifacts (e.g., extra shoe heels). We design a detector for such artifacts and employ it as a reward model to improve foreground fidelity.

Our PosterMaker model combines these techniques and achieves over 90% sentence‑level accuracy for Chinese text rendering.

2. Method

2.1 Task Definition

The task is to generate a poster image I from a product foreground image P, its mask M, a textual description T, and a background prompt prompt. The generated image must contain the product at the specified location and render the provided text accurately.

2.2 Overall Model Architecture

PosterMaker is built on SD3 and consists of two ControlNet branches: TextRenderNet for visual text rendering and SceneGenNet for background inpainting and product placement. Both branches are composed of cascaded MM‑DiT blocks whose outputs are added to the corresponding SD3 backbone blocks.

2.3 Character‑Level Visual Features

We extract a visual feature vector for each character using a pretrained OCR encoder, then average‑pool to obtain a compact representation. Position encodings (order and Fourier‑encoded bounding boxes) are concatenated with these vectors, passed through an adapter, and fed to TextRenderNet as control signals. This fine‑grained representation enables the model to capture stroke structures and achieve high‑precision text synthesis.

2.4 Improving Foreground Fidelity

2.4.1 Foreground‑Extension Detector

Based on HQ‑SAM, the detector receives the generated image, the foreground mask, and a bounding box, and predicts whether an extension artifact exists. The detector’s architecture is illustrated in Figure 2.

2.4.2 Reward‑Based Feedback Learning

The trained detector is used as a reward model. During fine‑tuning, a reward loss proportional to the extension score is added to the diffusion denoising loss, encouraging the generator to suppress foreground extensions while preserving other metrics.

2.5 Training Strategy

We adopt a two‑stage training scheme. Stage 1 freezes SceneGenNet and trains TextRenderNet on local text‑editing tasks. Stage 2 freezes TextRenderNet and trains SceneGenNet for full poster generation, allowing each module to specialize.

3. Experiments

We collected 160k product posters from Taobao for training and evaluation. Metrics include Sentence Accuracy (Sen. Acc), Normalized Edit Distance (NED) for text, FID for image quality, CLIP‑T for text‑image consistency, and a manually measured Foreground‑Extension Ratio.

3.1 Quantitative Results

PosterMaker outperforms all baselines on every metric, achieving >90% sentence accuracy for Chinese text.

3.2 Qualitative Visualization

Visual comparisons show that our method renders smaller characters more precisely than competing approaches.

3.3 Ablation Studies

Ablations confirm that character‑level visual features are essential for high‑quality text rendering, and that the reward‑based feedback significantly reduces foreground extensions without harming other scores.

4. Applications

The upgraded model is deployed in Alibaba Mama’s “万相营造” AI creative tool, enabling merchants to generate product posters automatically. A batch‑production pipeline has been built and demonstrated positive advertising performance in live experiments.

5. Conclusion

PosterMaker introduces a robust character‑level visual control signal and a reward‑driven foreground fidelity mechanism, achieving state‑of‑the‑art performance on Chinese product poster generation. The technology is already integrated into commercial tools and advertising workflows.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

diffusion models character-level features E-commerce AI poster generation text rendering

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.