Artificial Intelligence 24 min read

Performance Optimization of Advertising Deep Learning Systems: Algorithm, System, and Hardware Co‑Design

The paper presents a holistic algorithm‑system‑hardware co‑design for advertising deep‑learning inference, combining model pruning, approximate computing, kernel fusion, scheduling and PCIe transfer optimizations with GPU and NPU upgrades, achieving up to five‑fold speed‑up and significantly higher latency‑bounded QPS for large‑scale ad services.

Alimama Tech

Dec 22, 2021

Performance Optimization of Advertising Deep Learning Systems: Algorithm, System, and Hardware Co‑Design

Preface

In the era of global digitalization, digital advertising accounts for an increasing share of total ad spend (over $300 billion worldwide, $70 billion in China). Revenue from digital ads is tightly linked to click‑through‑rate (CTR) and conversion metrics, making deep‑learning‑based precise ad placement a high‑value problem.

1. Compute Demand and Supply

1.1 Compute Demand: Model Complexity

Online services must meet latency constraints, so deep‑learning inference typically runs on a CPU‑GPU/NPU heterogeneous system. For a CPU‑GPU single‑node (see Fig. 1.1), service capacity (QPS) can be expressed as:

QPS depends on request parallelism and latency; latency is the sum of CPU and GPU compute times. The optimization goal for online systems is latency‑bounded QPS.

1.2 Compute Supply: Heterogeneous Hardware

Processor specialization and heterogeneity are becoming trends. CPUs can no longer keep up with growing workloads, so dedicated AI processors (ASIC, NPU) are introduced. Architecture innovations (e.g., TensorCore, AVX‑512/AMX) and memory‑bandwidth advances (HBM, HMC) dramatically improve numerical performance (see Fig. 1.3).

Data‑access bandwidth remains a key limiter (the “memory wall”). Alibaba’s AI chip “含光800 NPU” integrates large SRAM; NVIDIA GPUs use HBM; CPUs have smaller bandwidth, leading to varied bottlenecks.

1.3 Problems and Optimization Methods

Advertising compute demand is rising rapidly. From DIEN to SIM to CAN models, FLOPs and memory traffic have increased 3‑fold, prompting the development of the XDL‑Blaze engine that tightly couples algorithm, system, and hardware to maximize latency‑bounded QPS.

2. Algorithm‑System‑Hardware Co‑Optimization

2.1 Algorithm Optimization

Three main directions are pursued: (1) model pruning to remove low‑impact structures; (2) approximate computing to replace expensive operators with lightweight alternatives; (3) computation compression to eliminate redundant calculations, especially during inference where batch size is small.

2.2 System Optimization: GPU Example

2.2.1 Compute‑Intensive Operator Optimization

Standard cuBLAS performs well for typical GEMM sizes, but long‑thin GEMM shapes common in advertising suffer up to 3× slower execution. TVM‑generated kernels achieve >7× speed‑up for these cases. Mixed‑precision (FP32 + FP16) with TensorCore yields an additional 1.3‑2× acceleration.

2.2.2 OP/Kernel Fusion

Kernel launch and memory traffic dominate performance. XLA/MLIR‑based kernel fusion and pattern‑fusion reduce both memory reads/writes and the number of kernel launches. For example, a Gather + BatchedMatMul hotspot (12.4 MB traffic) is fused into a custom IndicatorMatMul op, cutting global memory traffic by 96% and improving QPS by 2.6× (see Fig. 2.3).

2.2.3 Scheduling and Overhead Optimization

Online inference uses many small‑batch requests, making kernel launch overhead critical. Multi‑stream execution and multiple CUDA contexts reduce mutex contention (Fig. 2.4‑2.5). Virtual devices allow parallel requests to run on separate streams/contexts.

2.2.4 PCIe Copy Optimization

Embedding features require frequent CPU‑GPU transfers. By aggregating small transfers, PCIe copy latency drops from 4.5 ms to 400 µs.

2.3 Hardware Upgrade: NPU Example

Alibaba’s “含光800” NPU excels at low‑precision INT16/INT8 matrix multiplication, offering up to 2× speed‑up for fully‑connected ranking models (DQM). Quantization is performed offline; calibration data collected from online samples yields stable parameters. Accuracy loss is <1 % for 99 % of values, negligible for CTR ranking.

2.4 Performance Results

Figure 2.7 shows XDL‑Blaze’s gains across models and hardware. NPU outperforms T4/V100S on the simple DQM model (≈2×). For SIM and CAN, GPU upgrades (P100 → T4 → V100S) yield incremental improvements. Algorithmic graph merging, OP replacement, and system‑level tweaks deliver 4‑5× speed‑up for SIM/CAN and >2× over vanilla TensorFlow 1.15 + XLA.

Conclusion and Outlook

Continuous algorithmic innovation and business growth push the limits of deep‑learning engines. XDL‑Blaze unlocks the potential of tens of thousands of CPU cores and thousands of GPU/NPU cards to serve millions of QPS. Future work includes low‑precision/approximate computing (INT8, BFLOAT16, TF32, sparse), aggressive kernel fusion for memory‑bound operators, and building a comprehensive benchmark suite for heterogeneous AI processors to guide hardware selection.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization GPU Algorithmic Optimization NPU system engineering

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.