Artificial Intelligence 14 min read

High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization

The article describes how JD’s advertising team tackled the high‑concurrency, low‑latency challenges of online recommendation inference by designing a distributed graph heterogeneous computing framework, optimizing GPU kernel launches with TensorBatch, deep‑learning compiler techniques, and a multi‑stream GPU architecture, achieving significant throughput and latency improvements.

JD Tech

Mar 18, 2024

High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization

Online recommendation inference demands high concurrency and low latency; traditional scaling via resource expansion faces diminishing returns.

To address this, a distributed graph heterogeneous computing framework was built, splitting model graphs, deploying CPU‑GPU heterogeneous hardware, and enabling scalable inference for billions of parameters.

On the inference engine side, GPU kernel launch overhead was reduced through TensorBatch request aggregation, deep‑learning‑compiler operator fusion and bucketed pre‑compilation, and multi‑stream GPU execution.

TensorBatch aggregates multiple requests into a single GPU kernel launch, cutting launch count and doubling throughput. The compiler merges operators, reducing launches from 553 to 190 and latency from 14 ms to 9 ms.

Multi‑stream architecture creates multiple CUDA streams and contexts per request, eliminating kernel scheduling contention and achieving true parallelism; combined with MPS integration it overcomes memory bottlenecks.

These optimizations have been deployed across JD’s advertising services, scaling CTR models to trillions of parameters and delivering significant performance gains.

Future work will further integrate algorithm, compute, and architecture, and unify online and offline inference pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI inference GPU optimization Distributed computing Deep Learning Compiler

Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.