UI2CODE: Layout Analysis and Background/Foreground Extraction for UI Images
The UI2CODE system tackles UI layout analysis by first extracting backgrounds with Sobel, Laplacian and Canny edge detection plus a flood‑fill algorithm, then isolating foreground components through connected‑component analysis and a Faster R‑CNN classifier, and finally fusing both pipelines to achieve superior precision, recall and IoU on Xianyu app screenshots.
This article presents the UI2CODE project, focusing on the challenging step of layout analysis when converting complex UI screenshots into GUI elements.
The system is divided into two main modules: background analysis and foreground analysis.
Background analysis extracts the UI's background by applying edge‑detection algorithms such as Sobel, Laplacian and Canny. Gradient direction is used to distinguish solid‑color regions from gradient regions, and a discrete Laplacian template is employed to locate flat background areas.
After identifying background blocks, a flood‑fill (diffuse‑water) algorithm removes gradient backgrounds. The core implementation is shown below:
def fill_color_diffuse_water_from_img(task_out_dir, image, x, y, thres_up=(10,10,10), thres_down=(10,10,10), fill_color=(255,255,255)):
# get image height and width
h, w = image.shape[:2]
# create mask (required shape h+2, w+2, single‑channel uint8)
mask = np.zeros([h+2, w+2], np.uint8)
# perform flood fill with fixed‑range mode
cv2.floodFill(image, mask, (x, y), fill_color, thres_down, thres_up, cv2.FLOODFILL_FIXED_RANGE)
cv2.imwrite(task_out_dir + "/ui/tmp2.png", image)
return image, maskWith the background cleaned, foreground analysis proceeds to extract GUI fragments. Connected‑component analysis prevents fragmentation, while a deep‑learning classifier identifies component types and merges fragments iteratively until no residual pieces remain.
A concrete use case is the detection of waterfall‑flow cards in the Xianyu app. Traditional image‑processing steps include CLAHE contrast enhancement, Canny edge detection, morphological dilation, contour extraction, Douglas‑Peucker rectangle approximation, and horizontal/vertical projection to obtain smooth contours.
For higher recall, a deep‑learning pipeline based on Faster R‑CNN is employed. The network extracts features with a backbone (e.g., ResNet), generates region proposals, performs RoI pooling, and classifies & regresses bounding boxes.
The two streams are fused: both methods run in parallel, their boxes are filtered by IoU thresholds, and the remaining boxes are refined by snapping edges to the nearest detected lines (within a pixel tolerance). This yields a final set of boxes that combine the high localization accuracy of traditional methods with the high recall of deep learning.
Experiments on 50 Xianyu screenshots (96 cards) show that the traditional pipeline detects 65 cards, the deep‑learning pipeline 97 cards, and the fused approach 98 cards, achieving superior precision, recall and IoU as illustrated in the result tables.
In conclusion, the hybrid approach demonstrates that integrating classic computer‑vision techniques with modern deep‑learning models can produce robust UI element extraction, while acknowledging remaining challenges such as edge‑case refinement.
Xianyu Technology
Official account of the Xianyu technology team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.