Artificial Intelligence 20 min read

JD Retail's End‑to‑End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Real‑World Applications

This article details JD Retail's AI engine that seamlessly supports both GPU and domestic NPU hardware, describing its heterogeneous cluster architecture, unified training and inference APIs, performance optimizations, extensive model coverage, and multiple production use cases across e‑commerce, logistics, and intelligent assistance.

JD Tech

Mar 19, 2025

JD Retail's End‑to‑End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Real‑World Applications

In recent years, the rapid rise of domestic AI chips in China has created a critical need for model adaptation, performance optimization, and practical deployment on these chips. JD Retail's Nine‑Number Algorithm Platform (九数算法中台) addresses this by building a full‑stack AI engine that is compatible with both GPU and domestic NPU, spanning from hardware clusters to algorithmic engines and multi‑scenario applications.

Challenges

Significant hardware architecture differences: Existing JD compute clusters were GPU‑centric, while domestic NPU architectures differ greatly, requiring a unified scheduling and resource management system.

Immature software ecosystem: Open‑source frameworks lack native NPU support, leading to high migration costs for precision validation and performance tuning.

Diverse and complex business scenarios: JD Retail’s varied workloads demand a single solution that can be flexibly applied across many use cases.

AI Engine Architecture

The platform builds a thousand‑card scale cluster with high‑performance networking, offering identical scheduling capabilities for NPU and GPU. A unified API supports mainstream models, enabling zero‑cost training and deployment on either hardware.

Heterogeneous GPU‑NPU Scheduling System

Thousand‑card cluster: Visual monitoring, health checks, automatic fault isolation, and continuous HDK upgrades ensure stability.

Scheduling optimizations: NUMA‑aware and network‑topology‑aware scheduling, resource fragmentation minimization, and configurable priority eviction guarantee fairness and high utilization.

Efficient resource queues: Shared queues provide guaranteed and elastic resources for both NPU and GPU, maximizing overall cluster usage.

High‑Performance Training Engine

The engine supports over 40 mainstream LLM and multimodal base models (e.g., Qwen, Llama, ChatGLM) with a single API that allows seamless GPU‑NPU switching. Key techniques include MFU optimization, model quantization, Triton compilation, flash attention, dynamic input stitching, and pipeline parallelism, achieving up to 60% MFU on hundred‑card clusters and near‑linear scaling for trillion‑parameter models.

Model

Scale

GPU

Domestic NPU

GPU Inference

NPU Inference

SR1.5 Search‑Ad Large Model

3B/7B/15B

✅

Qwen2.5

0.5B‑14B

✅

ChatGLM2

✅

High‑Performance Inference Engine

The inference engine provides MaaS (Model‑as‑a‑Service) with one‑click NPU deployment, supporting 20+ industry‑standard LLMs and offering a 20% performance boost over open‑source frameworks. Optimizations include GE graph compilation, ATB high‑performance operators, W8A8/W4A16 quantization, prefill/decode separation, KV‑cache, and comprehensive monitoring and alerting.

Real‑World Cases

Case 1: Video Tag Cloud Generation

Using Qwen2‑VL on NPU, JD Retail extracts multi‑modal keywords from videos for tag cloud generation, achieving comparable quality and latency to GPU deployments across dozens of NPU cards.

Case 2: Logistics Address Parsing

Fine‑tuned Qwen2‑7B on NPU reaches 91.03% accuracy in address parsing, matching GPU results and powering large‑scale POI classification and pre‑sorting tasks.

Case 3: Merchant Smart Assistant

Fine‑tuned Qwen1.5‑7B on NPU provides merchant‑focused QA assistance with 96% tool‑assignment consistency compared to GPU models.

#Input_1
青海省西宁市城北区三其村。可以发圆通吗 谢谢。
#Output‑NPU（Domestic NPU）
青海省_1,西宁市_3A,城北区_3A,三其村_4B, _5A-1,可以发圆通吗 谢谢_UNK,
#Output‑GPU（GPU）
青海省_1,西宁市_3A,城北区_3A,三其村_4B, _5A-1,可以发圆通吗 谢谢 _UNK

Application Value

Core technology autonomy: Reduces reliance on foreign chips, ensuring security and controllability from hardware to applications.

Domestic chip applicability: Deployed in search, recommendation, ad creation, intelligent客服, and data analysis, providing feedback to the domestic chip ecosystem.

Industry Impact

2024: Co‑founded the openMind community with Huawei Ascend.

July: Presented five best‑practice scenarios at the Ascend AI Summit.

July: Won JD Retail’s 618 Technical Courage Award.

September: Received the Outstanding Ascend Native Developer award at Huawei CONNECT 2024.

Future Plans

By 2025, JD Retail aims to build a ten‑thousand‑card heterogeneous cluster with mixed GPU‑NPU scheduling, further optimizing resource prediction, dynamic scaling, and emergency pools to fully exploit domestic compute power.

Continued collaboration with leading domestic chip vendors will drive deeper HCCL communication optimizations, fused operator libraries, and open‑source contributions for both LLM and CTR training/inference workloads.

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.