Mask‑Guided Diffusion for Precise Product Image Generation
Mask‑Guided Diffusion combines instance‑mask training, Masked Canny ControlNet, and Mask‑guided Attribute Binding to preserve product details, correctly bind attributes, fix hand distortion, and generate uniform colored backgrounds, enabling merchants to quickly create high‑quality, controllable product images with Stable Diffusion.
With the rapid advancement of AIGC technologies such as Stable Diffusion, it becomes feasible to generate product images from textual prompts. This motivates the development of a system that can automatically replace product backgrounds and adjust model appearances.
The authors built an AI creative production tool, Wanxiang Lab, which integrates Stable Diffusion with control models (e.g., ControlNet) to let merchants generate diverse background scenes for a single product within minutes.
Key challenges identified include inaccurate product feature preservation, trade‑offs between foreground detail and background blur, attribute‑binding failures, hand‑distortion, and difficulty in generating uniform color backgrounds.
To address product/element control, two methods are proposed: (1) instance‑mask training, where high‑quality Taobao product images are segmented to create instance masks for inpainting model training, reducing over‑completion; (2) Masked Canny ControlNet inference, a training‑free strategy that expands a foreground mask, multiplies it with ControlNet output, and feeds the result to the U‑Net decoder, thereby preserving product edges while avoiding background interference.
For model attribute control, the paper introduces Mask‑guided Attribute Binding (MGAB). By extracting object masks from the prompt, a language‑guided loss aligns the attention maps of attribute and object tokens, ensuring that specified attributes (e.g., color) correctly follow the intended objects under visual control conditions.
Hand‑distortion is mitigated by reconstructing a 3‑D hand model from the distorted image, rendering depth and Canny maps, and using ControlNet to locally repaint the hand region, dramatically improving hand realism.
Pure‑color background generation combines Shuffle ControlNet with a local mask and a LoRA fine‑tuned on high‑quality white‑background images. A post‑processing color‑matcher then maps the white background to any target color, achieving stable, uniform backgrounds.
The system has been deployed in Wanxiang Lab, serving many merchants. Future work includes accelerating diffusion inference, improving foreground‑background lighting fusion, and further enhancing control precision.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.