Performance Optimization of Advertising Deep Learning Systems: Algorithm, System, and Hardware Co‑Design
The paper presents a holistic algorithm‑system‑hardware co‑design for advertising deep‑learning inference, combining model pruning, approximate computing, kernel fusion, scheduling and PCIe transfer optimizations with GPU and NPU upgrades, achieving up to five‑fold speed‑up and significantly higher latency‑bounded QPS for large‑scale ad services.
Preface
In the era of global digitalization, digital advertising accounts for an increasing share of total ad spend (over $300 billion worldwide, $70 billion in China). Revenue from digital ads is tightly linked to click‑through‑rate (CTR) and conversion metrics, making deep‑learning‑based precise ad placement a high‑value problem.
1. Compute Demand and Supply
1.1 Compute Demand: Model Complexity
Online services must meet latency constraints, so deep‑learning inference typically runs on a CPU‑GPU/NPU heterogeneous system. For a CPU‑GPU single‑node (see Fig. 1.1), service capacity (QPS) can be expressed as:
QPS depends on request parallelism and latency; latency is the sum of CPU and GPU compute times. The optimization goal for online systems is latency‑bounded QPS.
1.2 Compute Supply: Heterogeneous Hardware
Processor specialization and heterogeneity are becoming trends. CPUs can no longer keep up with growing workloads, so dedicated AI processors (ASIC, NPU) are introduced. Architecture innovations (e.g., TensorCore, AVX‑512/AMX) and memory‑bandwidth advances (HBM, HMC) dramatically improve numerical performance (see Fig. 1.3).
Data‑access bandwidth remains a key limiter (the “memory wall”). Alibaba’s AI chip “含光800 NPU” integrates large SRAM; NVIDIA GPUs use HBM; CPUs have smaller bandwidth, leading to varied bottlenecks.
1.3 Problems and Optimization Methods
Advertising compute demand is rising rapidly. From DIEN to SIM to CAN models, FLOPs and memory traffic have increased 3‑fold, prompting the development of the XDL‑Blaze engine that tightly couples algorithm, system, and hardware to maximize latency‑bounded QPS.
2. Algorithm‑System‑Hardware Co‑Optimization
2.1 Algorithm Optimization
Three main directions are pursued: (1) model pruning to remove low‑impact structures; (2) approximate computing to replace expensive operators with lightweight alternatives; (3) computation compression to eliminate redundant calculations, especially during inference where batch size is small.
2.2 System Optimization: GPU Example
2.2.1 Compute‑Intensive Operator Optimization
Standard cuBLAS performs well for typical GEMM sizes, but long‑thin GEMM shapes common in advertising suffer up to 3× slower execution. TVM‑generated kernels achieve >7× speed‑up for these cases. Mixed‑precision (FP32 + FP16) with TensorCore yields an additional 1.3‑2× acceleration.
2.2.2 OP/Kernel Fusion
Kernel launch and memory traffic dominate performance. XLA/MLIR‑based kernel fusion and pattern‑fusion reduce both memory reads/writes and the number of kernel launches. For example, a Gather + BatchedMatMul hotspot (12.4 MB traffic) is fused into a custom IndicatorMatMul op, cutting global memory traffic by 96% and improving QPS by 2.6× (see Fig. 2.3).
2.2.3 Scheduling and Overhead Optimization
Online inference uses many small‑batch requests, making kernel launch overhead critical. Multi‑stream execution and multiple CUDA contexts reduce mutex contention (Fig. 2.4‑2.5). Virtual devices allow parallel requests to run on separate streams/contexts.
2.2.4 PCIe Copy Optimization
Embedding features require frequent CPU‑GPU transfers. By aggregating small transfers, PCIe copy latency drops from 4.5 ms to 400 µs.
2.3 Hardware Upgrade: NPU Example
Alibaba’s “含光800” NPU excels at low‑precision INT16/INT8 matrix multiplication, offering up to 2× speed‑up for fully‑connected ranking models (DQM). Quantization is performed offline; calibration data collected from online samples yields stable parameters. Accuracy loss is <1 % for 99 % of values, negligible for CTR ranking.
2.4 Performance Results
Figure 2.7 shows XDL‑Blaze’s gains across models and hardware. NPU outperforms T4/V100S on the simple DQM model (≈2×). For SIM and CAN, GPU upgrades (P100 → T4 → V100S) yield incremental improvements. Algorithmic graph merging, OP replacement, and system‑level tweaks deliver 4‑5× speed‑up for SIM/CAN and >2× over vanilla TensorFlow 1.15 + XLA.
Conclusion and Outlook
Continuous algorithmic innovation and business growth push the limits of deep‑learning engines. XDL‑Blaze unlocks the potential of tens of thousands of CPU cores and thousands of GPU/NPU cards to serve millions of QPS. Future work includes low‑precision/approximate computing (INT8, BFLOAT16, TF32, sparse), aggressive kernel fusion for memory‑bound operators, and building a comprehensive benchmark suite for heterogeneous AI processors to guide hardware selection.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.