JD Retail Technology
Jan 25, 2024 · Artificial Intelligence
Optimizing High‑Concurrency Online Inference for Recommendation Models with Distributed Heterogeneous Computing and GPU Acceleration
This article describes how JD Retail's advertising technology team tackled the high‑compute demands of modern recommendation models by designing a distributed graph‑partitioned heterogeneous computing framework, introducing TensorBatch request aggregation, leveraging deep‑learning compiler bucketing and asynchronous compilation, and implementing a multi‑stream GPU architecture to dramatically improve online inference throughput and latency.
GPU AccelerationOnline InferenceRecommendation systems
0 likes · 13 min read