High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization
The article describes how JD’s advertising team tackled the high‑concurrency, low‑latency challenges of online recommendation inference by designing a distributed graph heterogeneous computing framework, optimizing GPU kernel launches with TensorBatch, deep‑learning compiler techniques, and a multi‑stream GPU architecture, achieving significant throughput and latency improvements.
Online recommendation inference demands high concurrency and low latency; traditional scaling via resource expansion faces diminishing returns.
To address this, a distributed graph heterogeneous computing framework was built, splitting model graphs, deploying CPU‑GPU heterogeneous hardware, and enabling scalable inference for billions of parameters.
On the inference engine side, GPU kernel launch overhead was reduced through TensorBatch request aggregation, deep‑learning‑compiler operator fusion and bucketed pre‑compilation, and multi‑stream GPU execution.
TensorBatch aggregates multiple requests into a single GPU kernel launch, cutting launch count and doubling throughput. The compiler merges operators, reducing launches from 553 to 190 and latency from 14 ms to 9 ms.
Multi‑stream architecture creates multiple CUDA streams and contexts per request, eliminating kernel scheduling contention and achieving true parallelism; combined with MPS integration it overcomes memory bottlenecks.
These optimizations have been deployed across JD’s advertising services, scaling CTR models to trillions of parameters and delivering significant performance gains.
Future work will further integrate algorithm, compute, and architecture, and unify online and offline inference pipelines.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.