Artificial Intelligence 13 min read

Optimizing High‑Concurrency Online Inference for Recommendation Models with Distributed Heterogeneous Computing and GPU Acceleration

This article describes how JD Retail's advertising technology team tackled the high‑compute demands of modern recommendation models by designing a distributed graph‑partitioned heterogeneous computing framework, introducing TensorBatch request aggregation, leveraging deep‑learning compiler bucketing and asynchronous compilation, and implementing a multi‑stream GPU architecture to dramatically improve online inference throughput and latency.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Optimizing High‑Concurrency Online Inference for Recommendation Models with Distributed Heterogeneous Computing and GPU Acceleration

In the context of domain transformation, JD Retail’s advertising technology team investigated high‑performance computing optimizations for online inference scenarios that require high concurrency and low latency, focusing on heterogeneous computing frameworks and GPU‑accelerated inference.

The challenges include diminishing marginal returns from resource scaling, the exponential growth of model parameters and complexity when moving from simple Wide & Deep networks to Transformer‑based architectures, and the difficulty of meeting real‑time requirements without sacrificing throughput.

To address these issues, the team designed a distributed graph‑partitioned heterogeneous computing framework that partitions models, distributes inference across CPU and GPU clusters, and aligns algorithmic structures with hardware capabilities, enabling exponential compute scaling.

At the inference‑engine level, fine‑grained GPU operator scheduling and computation‑logic optimizations were applied to fully exploit GPU performance.

The distributed framework separates sparse model training on CPU clusters (supporting billions of parameters) from dense model training on GPU clusters (handling long user‑behavior sequences), and also powers an online‑learning scenario that adapts to real‑time changes in users, items, and contexts.

TensorBatch aggregates multiple inference requests into a single batch, reducing the number of kernel launches and boosting GPU throughput by up to two‑fold.

The deep‑learning compiler component (based on XLA) faced challenges with variable‑length inputs and excessive runtime compilation; the team introduced sub‑graph bucketing, padding, and pre‑compilation techniques to dramatically cut compilation count, memory usage, and latency.

Asynchronous compilation was added to handle out‑of‑bucket traffic, dynamically selecting between pre‑compiled XLA runtimes and the original graph while triggering background compilation for future requests.

Multi‑stream computing creates multiple CUDA streams and contexts per GPU, eliminating kernel‑launch contention and enabling true parallel execution; this was further combined with NVIDIA’s Multi‑Process Service (MPS) to overcome memory bottlenecks and achieve genuine parallelism.

Overall, the integrated solutions—distributed heterogeneous framework, TensorBatch, deep‑learning compiler optimizations, and multi‑stream GPU architecture—delivered significant performance gains, allowing CTR models to scale to billions of parameters and improving recommendation effectiveness across JD’s advertising business.

GPU AccelerationRecommendation systemsOnline Inferencedistributed computingdeep learning compilerheterogeneous architecture
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.