Artificial Intelligence 14 min read

High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization

The article describes how JD’s advertising team tackled the high‑concurrency, low‑latency challenges of online recommendation inference by designing a distributed graph heterogeneous computing framework, optimizing GPU kernel launches with TensorBatch, deep‑learning compiler techniques, and a multi‑stream GPU architecture, achieving significant throughput and latency improvements.

JD Tech
JD Tech
JD Tech
High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization

Online recommendation inference demands high concurrency and low latency; traditional scaling via resource expansion faces diminishing returns.

To address this, a distributed graph heterogeneous computing framework was built, splitting model graphs, deploying CPU‑GPU heterogeneous hardware, and enabling scalable inference for billions of parameters.

On the inference engine side, GPU kernel launch overhead was reduced through TensorBatch request aggregation, deep‑learning‑compiler operator fusion and bucketed pre‑compilation, and multi‑stream GPU execution.

TensorBatch aggregates multiple requests into a single GPU kernel launch, cutting launch count and doubling throughput. The compiler merges operators, reducing launches from 553 to 190 and latency from 14 ms to 9 ms.

Multi‑stream architecture creates multiple CUDA streams and contexts per request, eliminating kernel scheduling contention and achieving true parallelism; combined with MPS integration it overcomes memory bottlenecks.

These optimizations have been deployed across JD’s advertising services, scaling CTR models to trillions of parameters and delivering significant performance gains.

Future work will further integrate algorithm, compute, and architecture, and unify online and offline inference pipelines.

AI inferencehigh performanceGPU optimizationdistributed computingdeep learning compiler
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.