Ele.me Vertical Business AIGC Image Model: Architecture, Training Pipeline, and Evaluation
Ele.me created a domain-specific AIGC image model built from scratch on its own data using the DiT backbone, a three-stage training pipeline (transformer pre-training, prompt alignment, aesthetic fine-tuning), custom T5‑E‑CLIP text and visual encoders, ControlNet for layout control, and evaluated via FID, CLIP scores and a human rubric, enabling automated dish-image generation and UI asset creation for its vertical business.
Introduction – This article describes the development of an AIGC (AI‑generated content) image model for Ele.me’s vertical business scenarios. The model is trained from scratch on Ele.me’s own data using the latest DiT architecture and natively supports “one image beats a thousand words” image prompts. It has been applied across various domains such as intelligent UI assets for search‑push, merchant‑side dish‑image generation tools, and automated visual material production.
1. Background & Pain Points
Since the release of DALL·E (2021) and StableDiffusion 1.5 (2022), text‑to‑image generation has become a hot research area. Visual AIGC models have dramatically changed how visual content is produced. Within Ele.me, visual content is needed in many scenarios (merchant side, search‑push, marketing, etc.), especially for dish images that dominate the vertical domain. The large volume and long‑tail distribution of dish images present a key challenge for AIGC deployment.
2. Self‑Developed AIGC Model
2.1 Training Process Overview – The training pipeline follows a progressive three‑stage approach:
Stage 1: Transformer Pre‑train – learns basic pixel distribution and semantic relations of food categories with low cost.
Stage 2: Prompt Condition Alignment – aligns text prompts and image prompts.
Stage 3: Aesthetic Finetune – uses high‑quality image data to improve visual aesthetics.
2.2 Model Architecture
The backbone is based on the DiT architecture, extended with both image and text conditioning. The text encoder combines a pretrained T5 encoder and a self‑developed E‑CLIP encoder to enhance domain‑specific textual understanding. The visual encoder uses an E‑CLIP image encoder trained on Ele.me’s domain data, followed by a projection layer and an image‑multi‑head cross‑attention layer. Classic LLM components such as RMSNorm, SwiGLU, RoPE, and QK‑norm are incorporated for training stability and speed.
2.2.1 Text Encoder – Utilises a pretrained T5 encoder for general semantic understanding and a custom E‑CLIP encoder to capture dish‑specific information. Their embeddings are concatenated and projected before entering the cross‑attention denoising network.
2.2.2 Visual Encoder – The E‑CLIP image encoder extracts visual semantics; a small trainable projection layer converts these features into a sequence compatible with the DiT blocks, and an image‑multi‑head cross‑attention layer enables interaction with text features.
2.3 ControlNet
After achieving basic text‑to‑image generation, a ControlNet structure is added to provide fine‑grained control over layout, dish shape, and plating. Models for Canny, depth, and HED conditions are trained, as well as a ControlNet‑inpainting model for localized re‑painting of dishes.
3. Model Capability Evaluation
A domain‑specific dish evaluation dataset was built for assessment.
3.1 Objective Metrics – Focus on FID (comparing generated images with human‑matched ground‑truth dishes) and CLIP Alignment scores.
3.2 Subjective Evaluation – A custom AIGC dish‑evaluation rubric was designed for human rating, reflecting Ele.me’s practical use cases.
The work was contributed by Luo Te, Ke Lai, Qing Chang, Mo Li, Cai Ying, and Xuan Dong.
Ele.me Technology
Creating a better life through technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.