From UI Sketch to Code: Frontend Intelligence Generates 79% of Double‑11 Modules
This article explains how Alibaba's Front‑End Intelligent project automatically converts UI design images into production‑ready code, covering layout analysis, background and foreground processing, a fusion of traditional image algorithms with deep‑learning detection, GAN‑based complex‑background extraction, experimental results and real‑world deployment.
Overview
The Front‑End Intelligent project, one of the four technical directions of Alibaba's Front‑End Committee, proved its value during the 2019 Double‑11 event, automatically generating 79.34% of the online code for new modules on Tmall and Taobao. This series shares the techniques and thoughts behind that achievement.
Why Use Images as Input
Images are the final deliverable, intuitive and deterministic, without upstream constraints.
Layout differences (e.g., listview, gridview) do not exist in visual drafts.
Image‑based pipelines support broader scenarios such as automated testing and competitor‑image reuse.
Layer stacking issues in design drafts are easier to handle when starting from images.
Layer Processing
In the D2C stack, the layer handling layer identifies element categories and extracts styles, providing data for the subsequent layout algorithm layer.
Layout Analysis
Layout analysis splits UI images into foreground and background. Background analysis uses machine‑vision algorithms to detect color, gradient direction, and connected regions, while foreground analysis employs deep‑learning models to merge and recognize GUI fragments.
Background analysis: analyze background color, gradient direction, and connected areas. Foreground analysis: use deep‑learning to organize, merge, and recognize GUI fragments.
Background Analysis
Step 1: Detect background blocks with edge detectors (Sobel, Laplacian, Canny) to separate solid‑color and gradient regions. The Laplacian template is illustrated below.
If a gradient background is detected, step 2 applies a flood‑fill algorithm to refine it.
<code>def fill_color_diffuse_water_from_img(task_out_dir, image, x, y, thres_up=(10,10,10), thres_down=(10,10,10), fill_color=(255,255,255)):
# Obtain image height and width
h, w = image.shape[:2]
# Create a (h+2, w+2) single‑channel mask required by OpenCV
mask = np.zeros([h+2, w+2], np.uint8)
# Perform flood fill with specified thresholds and fixed‑range mode
cv2.floodFill(image, mask, (x, y), fill_color, thres_down, thres_up, cv2.FLOODFILL_FIXED_RANGE)
cv2.imwrite(task_out_dir + "/ui/tmp2.png", image)
return image, mask</code>Resulting images show the original and the processed output.
Foreground Analysis
Foreground processing focuses on component integrity: connected‑component analysis prevents fragmenting, followed by machine‑learning classification and merging until no residual fragments remain. An example of a complete item in a waterfall‑flow layout is shown.
Traditional edge‑gradient methods (CLAHE, Canny, morphological dilation, Douglas‑Peucker) are compared with deep‑learning detectors (Faster‑RCNN, YOLO, SSD). The fusion of both approaches yields high precision, recall, and localization (IOU).
Fusion Process
Run traditional image processing and deep‑learning detection in parallel, obtaining trbox and dlbox .
Filter trbox : keep boxes whose IOU with dlbox exceeds a threshold (e.g., 0.8).
Filter dlbox : discard boxes whose IOU with the retained trbox exceeds the threshold.
Adjust remaining dlbox edges to the nearest straight line within a pixel limit, without crossing trbox boundaries.
Output the union of filtered trbox and adjusted dlbox as the final result.
Metrics
True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN) are defined, and the standard formulas for Precision = TP/(TP+FP), Recall = TP/(TP+FN), and IOU = intersection/union are used to evaluate the methods.
Experimental Results
On 50 randomly sampled Xianyu waterfall‑flow images (96 cards total), traditional methods detected 65 cards, deep‑learning detected 97, and the fused approach detected 98 with superior precision, recall, and IOU. Detailed tables and charts illustrate the comparison.
Complex Background Content Extraction
Extracting specific content from complex backgrounds is challenging for both traditional image processing (low recall) and semantic segmentation (no pixel‑level restoration). The proposed pipeline uses a detection network for content recall, gradient‑based region judgment, and a SR‑GAN to restore elements in complex regions.
Why GAN?
The SR‑GAN incorporates a feature‑map loss to preserve high‑frequency details, an adversarial loss to reduce false detections, and can reconstruct pixel values behind semi‑transparent overlays—something pure segmentation cannot achieve.
Training Flow
Business Deployments
The solution is already used in the imgcook image pipeline (≈73% accuracy for generic scenes, >92% for specific card layouts) and in Taobao’s automated testing for major promotions, achieving >97% precision and recall.
Future Work
Planned improvements include richer layout identification (listview, gridview, waterfall), higher accuracy for small objects via Feature Pyramid Networks and Cascade R‑CNN, broader page coverage beyond Xianyu and Taobao, and an image‑sample generator to lower onboarding effort.
Taobao Frontend Technology
The frontend landscape is constantly evolving, with rapid innovations across familiar languages. Like us, your understanding of the frontend is continually refreshed. Join us on Taobao, a vibrant, all‑encompassing platform, to uncover limitless potential.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.