Artificial Intelligence 15 min read

Engineering Optimizations for Large‑Scale Advertising Recall Models: Full‑Cache Scoring and Index Flattening

Alibaba Mama’s advertising platform modernized its Tree‑based Deep Model by introducing a dual‑tower full‑library DNN with aggressive pre‑filtering and custom GPU TopK kernels, and a flattened‑tree model that retains beam search with multi‑head attention, while applying memory‑aware tricks such as attention swapping, softmax approximation, tiled‑matmul splitting, TensorCore batching, INT8 quantization and cache‑resident ad vectors, enabling multi‑fold latency reductions with minimal recall loss.

Alimama Tech

Sep 8, 2021

Engineering Optimizations for Large‑Scale Advertising Recall Models: Full‑Cache Scoring and Index Flattening

The article describes how Alibaba Mama’s advertising system upgraded its Tree‑based Deep Model (TDM) to meet modern latency and hardware constraints. Two major upgrade paths are presented: a full‑library scoring model that abandons the tree index in favor of a dual‑tower DNN, and an index‑flattening model that compresses the original multi‑level binary tree while retaining beam search.

Path 1 – Full‑Library Model The dual‑tower architecture computes user and ad embeddings separately and uses their inner product with a TopK filter to select candidates. The main bottleneck is the TopK operation on millions of items; a pre‑filter based on percentile thresholds reduces the candidate set by up to 1000×, and the filtered TopK is implemented with custom GPU kernels. Performance tables show latency improvements of several‑fold compared with TensorFlow’s native TopK.

Path 2 – Index Flattening Model The deep tree is compressed to three‑four layers, expanding the first layer to thousands of nodes and then performing beam search from the second layer. The model adds multi‑head attention on user‑ad feature interactions. Challenges include representing the tree in TensorFlow, handling large candidate sets, and managing memory bandwidth.

Memory and Compute Optimizations Several techniques are applied: (1) swapping broadcast and transpose in attention to avoid extra memory movement; (2) approximating softmax(AB)C as f(A)·f(B)C to reduce intermediate sizes; (3) splitting tile‑concat‑matmul in the first DNN layer to eliminate costly tiling of small‑batch inputs; (4) batching GEMM with TensorCore to gain 20‑40% latency reduction; (5) INT8 quantization of the GEMM, with careful handling of matrix layout transformations; (6) aggressive cache residency for ad vectors, especially on ASICs with larger on‑chip memory.

Beam Search Width Tuning Reducing beam width directly cuts scoring volume. Experiments show that halving the width (from 15 W to 7.5 W) only slightly lowers recall (from 0.545 to 0.541), allowing dynamic trade‑offs without model replacement.

The conclusion emphasizes the importance of co‑design between algorithms and systems: TopK filtering requires sufficient data separability, while index‑flattening demands careful attention design and resource budgeting. The paper provides practical insights for large‑scale recommendation workloads.

References: Zhu et al., 2018 (TDM); Johnson et al., 2019 (GPU similarity search); Zhou et al., 2018 (DIEN); Choromanski et al., 2020 (Performer attention).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

model optimization GPU Acceleration Recommendation Systems Beam Search topk

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.