Artificial Intelligence 15 min read

EcomXL: Optimizing SDXL for Large‑Scale E‑commerce Image Generation

EcomXL enhances SDXL for large‑scale e‑commerce image generation by leveraging tens of millions of curated images, a two‑stage fine‑tuning with denoising‑weighted distillation and layer‑wise fusion, specialized ControlNets for inpainting and soft‑edge consistency, and the SLAM inference accelerator to achieve sub‑second generation while boosting visual quality and adoption metrics.

Alimama Tech
Alimama Tech
Alimama Tech
EcomXL: Optimizing SDXL for Large‑Scale E‑commerce Image Generation

With the rise of generative AI, Stable Diffusion combined with ControlNet has been widely applied in e‑commerce scenarios. Creating high‑quality product main images is crucial for click‑through and conversion, yet it incurs significant time and cost. Alibaba Mama’s Wanxiang Lab aims to reduce these costs by leveraging AIGC.

In July 2023 the SDXL text‑to‑image model was released, offering superior semantic understanding and aesthetics compared to SD‑1.5, but its larger parameter size brings training and inference challenges. Wanxiang Lab performed multi‑angle effectiveness and inference‑speed optimizations and deployed the model (EcomXL) online.

Problem definition : SDXL still struggles with e‑commerce‑specific requirements such as realistic human portraits, e‑commerce style backgrounds, seamless product‑background fusion, and varied service latency. These gaps motivate the development of the EcomXL series.

Model optimization :

Collected tens of millions of high‑quality human and background images from public and internal sources; applied multimodal large‑model tagging and a two‑stage data‑screening strategy.

Introduced a two‑stage fine‑tuning method: full‑parameter fine‑tuning with a denoising‑step‑weighted distillation loss, followed by layer‑wise model fusion to preserve SDXL’s semantic strength while injecting e‑commerce‑specific improvements.

The distillation loss weights the denoising loss early in training (emphasizing semantic alignment) and gradually shifts to the original loss later (focusing on fine details). Layer‑wise fusion uses a weighted combination of original and fine‑tuned weights, targeting layers that most affect facial quality.

EcomXL‑ControlNet extends the base text‑to‑image model with two specialized ControlNets:

Inpainting ControlNet trained first on generic random masks, then fine‑tuned on e‑commerce instance masks to preserve foreground product details while generating realistic backgrounds and human limbs.

Soft‑edge ControlNet trained on millions of high‑beauty images to enforce edge consistency for product outlines and compositional elements, using a mixture of hed, pidinet, and pidisafe edge extractors.

Inference acceleration (SLAM) : To meet sub‑second generation requirements, the Sub‑Path Linear Approximation Model (SLAM) reduces inference steps from 25 to 4 while achieving quality comparable to LCM at double the steps. SLAM builds linear sub‑paths between diffusion timesteps and samples via random linear interpolation, lowering cumulative mapping error.

Business impact : Compared with the previous Ecom1.5 model, EcomXL improves visual usability (+5 pts), 1‑vs‑1 win rate (+2.8 pts), and online adoption (+2 pts). The solution is now the primary model in Wanxiang Lab, supporting 3‑second generation and an “inspiration recommendation” feature that shortens end‑to‑end latency to under 5 seconds.

All code and models are released on HuggingFace (EcomXL‑ControlNet and SLAM) and the work is documented in an arXiv paper (https://arxiv.org/abs/2404.13903).

Inference Accelerationimage generationAIGCControlNetEcomXLSDXL
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.