Fully Automatic Template‑Free Image‑Text Creative Generation System
Alibaba Alimama’s fully automatic, template‑free image‑text creative generation system uses deep‑learning models across material mining, layout synthesis, on‑image copy generation, and visual attribute rendering to produce personalized ad creatives directly from product images and metadata, achieving roughly 19 % CTR lift over prior template‑based methods.
This article introduces a next‑generation, fully automatic image‑text creative generation system developed by Alibaba Alimama. Unlike traditional template‑based or programmatic stitching approaches, the system creates diverse and personalized ad creatives directly from raw product images and textual information without any designer‑crafted templates.
System Overview : The pipeline is divided into four key stages – (1) material mining & background generation, (2) image layout generation, (3) on‑image copy generation, and (4) visual attribute estimation & rendering. Each stage is powered by deep learning models that learn from massive designer‑created ad data.
Material Mining & Generation : A classification model first filters unsuitable images (e.g., pure‑text or collage images). A detector then identifies Photoshop elements (logo, text, decoration, background) and an inpainting model removes them. For extreme aspect‑ratio targets, an out‑painting GAN extends the image content while preserving semantic continuity.
Image Layout Generation : Two complementary models are explored. A GAN‑based layout generator uses a domain‑alignment module and a content‑aware cross‑attention mechanism to predict element categories and positions. An Autoregressive Transformer with a VAE latent space predicts layout sequences, enabling diverse and controllable layouts and supporting geometry‑aligned attention to avoid the main product region.
On‑Image Copy Generation : A multimodal transformer takes product metadata, image features, and layout information as inputs and autoregressively generates copy that matches the spatial context (e.g., long copy, short copy, selling points). The model is trained on 400 M product images, achieving ~90 % human‑review pass rate.
Attribute Estimation & Rendering : Self‑supervised font‑style models and color‑extraction pipelines predict font, color, gradient, and stroke attributes. An encoder‑decoder network refines these attributes using K‑means quantized image features and focal loss to handle long‑tail distributions. The final renderer composites text, logos, and backgrounds into a polished creative.
Business Impact : Large‑scale A/B tests on Alibaba’s ad platforms (homepage focus images and feed) show CTR lifts of +19.26 % and +18.94 % respectively compared with the previous template‑based dynamic description baseline. Visual quality assessments confirm more natural layouts, richer color schemes, and better text‑image harmony.
Conclusion : The system achieves “template‑free” generation, offering end‑to‑end automation, interpretability, and interactive editing capabilities. It demonstrates how AI can replace manual design pipelines while preserving or improving advertising effectiveness.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.