PosterMaker: High-Quality Product Poster Generation with Accurate Text Rendering
PosterMaker leverages a ControlNet‑based TextRenderNet with character‑level visual features and a reward‑driven foreground‑extension detector to generate high‑quality product posters that accurately render Chinese text (over 90% sentence accuracy) while preserving product fidelity, and is already deployed in Alibaba’s AI creative tool.
1. Introduction
Product image‑text posters are crucial for e‑commerce platforms. Creating high‑quality posters requires placing a product in a suitable background, selecting appropriate fonts and colors, and ensuring the text is readable and well‑aligned with the scene. This process is labor‑intensive for small merchants. Recent advances in text‑to‑image (T2I) diffusion models have motivated research on automatic poster generation. This work focuses on the task of generating a product poster given a background prompt, a foreground product image, and desired textual content with placement specifications.
Direct pipelines that first synthesize a text‑free scene and then render text suffer from a lack of training data for font, color, and style attributes. Instead, we adopt a single‑stage pixel‑wise generation approach, which learns the distribution of professionally designed posters. The main challenge is achieving precise text rendering, especially for Chinese characters, which are numerous and structurally complex. We propose to use robust character‑level visual representations as control signals and introduce TextRenderNet , a ControlNet‑based module built on Stable Diffusion 3 (SD3) that renders visual text with high accuracy.
Preserving the fidelity of the foreground product is another key issue. Inpainting‑based background generation keeps the foreground unchanged, but can produce “foreground extension” artifacts (e.g., extra shoe heels). We design a detector for such artifacts and employ it as a reward model to improve foreground fidelity.
Our PosterMaker model combines these techniques and achieves over 90% sentence‑level accuracy for Chinese text rendering.
2. Method
2.1 Task Definition
The task is to generate a poster image I from a product foreground image P , its mask M , a textual description T , and a background prompt prompt . The generated image must contain the product at the specified location and render the provided text accurately.
2.2 Overall Model Architecture
PosterMaker is built on SD3 and consists of two ControlNet branches: TextRenderNet for visual text rendering and SceneGenNet for background inpainting and product placement. Both branches are composed of cascaded MM‑DiT blocks whose outputs are added to the corresponding SD3 backbone blocks.
2.3 Character‑Level Visual Features
We extract a visual feature vector for each character using a pretrained OCR encoder, then average‑pool to obtain a compact representation. Position encodings (order and Fourier‑encoded bounding boxes) are concatenated with these vectors, passed through an adapter, and fed to TextRenderNet as control signals. This fine‑grained representation enables the model to capture stroke structures and achieve high‑precision text synthesis.
2.4 Improving Foreground Fidelity
2.4.1 Foreground‑Extension Detector
Based on HQ‑SAM, the detector receives the generated image, the foreground mask, and a bounding box, and predicts whether an extension artifact exists. The detector’s architecture is illustrated in Figure 2.
2.4.2 Reward‑Based Feedback Learning
The trained detector is used as a reward model. During fine‑tuning, a reward loss proportional to the extension score is added to the diffusion denoising loss, encouraging the generator to suppress foreground extensions while preserving other metrics.
2.5 Training Strategy
We adopt a two‑stage training scheme. Stage 1 freezes SceneGenNet and trains TextRenderNet on local text‑editing tasks. Stage 2 freezes TextRenderNet and trains SceneGenNet for full poster generation, allowing each module to specialize.
3. Experiments
We collected 160k product posters from Taobao for training and evaluation. Metrics include Sentence Accuracy (Sen. Acc), Normalized Edit Distance (NED) for text, FID for image quality, CLIP‑T for text‑image consistency, and a manually measured Foreground‑Extension Ratio.
3.1 Quantitative Results
PosterMaker outperforms all baselines on every metric, achieving >90% sentence accuracy for Chinese text.
3.2 Qualitative Visualization
Visual comparisons show that our method renders smaller characters more precisely than competing approaches.
3.3 Ablation Studies
Ablations confirm that character‑level visual features are essential for high‑quality text rendering, and that the reward‑based feedback significantly reduces foreground extensions without harming other scores.
4. Applications
The upgraded model is deployed in Alibaba Mama’s “万相营造” AI creative tool, enabling merchants to generate product posters automatically. A batch‑production pipeline has been built and demonstrated positive advertising performance in live experiments.
5. Conclusion
PosterMaker introduces a robust character‑level visual control signal and a reward‑driven foreground fidelity mechanism, achieving state‑of‑the‑art performance on Chinese product poster generation. The technology is already integrated into commercial tools and advertising workflows.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.