Artificial Intelligence 20 min read

AI Engine Technology Based on Domestic Chips for JD Retail

This article describes JD Retail's AI engine built on domestic NPU chips, covering challenges, heterogeneous GPU‑NPU scheduling, high‑performance training and inference engines, extensive model support, real‑world deployment cases, and future plans for large‑scale chip clusters and ecosystem development.

JD Tech Talk

Mar 3, 2025

AI Engine Technology Based on Domestic Chips for JD Retail

1. Introduction

With the widespread adoption of large models, AI compute power—one of the three pillars of artificial intelligence—has become a competitive focus. Computing resources affect every stage of a model's lifecycle, from training to inference, and are crucial for JD's massive data scenarios. Recent U.S. export restrictions on high‑end AI chips have raised concerns about compute security, prompting Chinese industry associations to call for reduced reliance on foreign chips and increased cooperation with domestic manufacturers.

Deploying domestic chips in JD's business scenarios faces three main challenges: significant hardware architecture differences between GPUs and domestic NPUs, an immature software ecosystem for NPUs, and diverse, complex business requirements.

2. AI Engine Technology Based on Domestic Chips

2.1 Overall Architecture

The architecture diagram (see image) illustrates a unified AI engine that supports both GPU and domestic NPU resources across a thousand‑card scale cluster with RDMA interconnects.

2.2 Heterogeneous GPU‑NPU Scheduling System

The platform provides a unified quota and allocation system, allowing developers to schedule NPU and GPU resources transparently. Key features include:

Thousand‑card Cluster: Visual monitoring of NPU cards, network cards, and optical modules; health checks, automatic fault isolation, and alerting.

Scheduling Optimization: NUMA‑aware and network‑topology‑aware scheduling, resource‑fragment minimization (Gang, BinPack, reservation), and configurable priority‑preemption mechanisms.

Efficient Resource Usage: Shared resource queues guarantee minimum resources while allowing dynamic sharing of idle capacity, maximizing NPU utilization.

2.3 High‑Performance Training Engine

The training engine supports over 40 mainstream base models (LLM, multimodal, text‑to‑image, etc.) with a zero‑cost, seamless switch between GPU and NPU via a highly abstracted API. It integrates model‑parallelism, sequence‑parallelism, low‑precision communication, and compute‑communication fusion to boost throughput. Notable capabilities:

Broad Model Coverage: 30+ LLMs and 10+ multimodal bases, enabling cost‑free migration between GPUs and NPUs.

Full LLM Training Workflow: End‑to‑end data handling, labeling, evaluation, and support for various data generation, instruction tuning, and evaluation types.

Deep Soft‑Hardware Co‑Optimization: Triton compilation and CANN fusion for hot operators (flash attention, rotary embedding, etc.), achieving up to 60% MFU on hundred‑card setups and near‑linear scaling for trillion‑parameter models.

High‑Availability Training: Token pre‑caching, minute‑level asynchronous checkpointing, and on‑demand snapshot delivery reduce startup time from hours to minutes and cut model storage time by over 90%.

Supported model matrix (GPU vs. domestic NPU) shows full compatibility across models such as SR1.5, Qwen2.5, ChatGLM2/3, GLM4, Llama series, YI series, Baichuan2, Bloom‑z, Gemma, etc.

2.4 High‑Performance Inference Engine

The inference engine offers MaaS (Model‑as‑a‑Service) one‑click deployment on domestic NPUs, compatible with OpenAI and Triton APIs, and supports over 20 mainstream LLMs. Performance optimizations include:

Model Optimizations: GE graph compilation, ATB high‑performance operators, quantization (W8A8 SmoothQuant, W4A16 AWQ) and pipeline parallelism to hide scheduling overhead.

Framework Optimizations: Prefill/Decode separation, KV‑cache and Prefix‑cache techniques to accelerate inference.

Monitoring & Alerting: Visual dashboards for throughput, failure rate, latency, with customizable alerts.

Supported inference models include Baichuan, ChatGLM, Qwen, Llama, and multimodal models such as SD1.5, SDXL, Mistral‑7B.

3. Deployment Cases

Case 1: Video Content Tag Generation – Using Qwen2‑VL on domestic NPU for multimodal video analysis, achieving comparable latency and token output to GPU with tens of NPU cards.

Case 2: Logistics Large Model – Fine‑tuning Qwen2‑7B for address parsing and classification on NPU, reaching 91.03% accuracy (GPU 91.08%) and deployed in address classification and pre‑sorting pipelines.

#Input_1
青海省西宁市城北区三其村。可以发圆通吗 谢谢。
#Output‑NPU（Domestic NPU）
青海省_1,西宁市_3A,城北区_3A,三其村_4B, _5A-1,可以发圆通吗 谢谢_UNK,
#Output‑GPU（GPU）
青海省_1,西宁市_3A,城北区_3A,三其村_4B, _5A-1,可以发圆通吗 谢谢 _UNK

Case 3: Merchant‑Side Intelligent Assistant – Fine‑tuning Qwen1.5‑7B on NPU for QA routing, achieving 96% agreement with GPU‑based routing.

#Input_1
上架宝贝数怎么看？
#Output‑Domestic NPU
{..."tool_name":"business_expert","query":"如何查看已上架的商品数量？"...}
#Output‑GPU
{..."tool_name":"business_expert","query":"如何查看已上架的商品数量？"...}
#Input_2
为啥我不能提报活动了？
#Output‑Domestic NPU
{..."tool_name":"business_expert","query":"为什么商家不能提报活动，以及如何解决提报问题？"...}
#Output‑GPU
{..."tool_name":"business_expert","query":"商家无法提报活动的可能原因及解决方案是什么？"...}

4. Business Value

Core Technology Autonomy: Reduces dependence on foreign chips, ensuring security and controllability across the stack.

Domestic Chip Applicability: Deployed in search recommendation, ad creative generation, intelligent客服, and data analysis, providing practical feedback to the domestic chip ecosystem.

5. Industry Impact

2024: JD Retail co‑founded the Openmind open‑source community with Huawei Ascend.

July: Participated in the Ascend AI Industry Summit, showcasing five best‑practice scenarios.

July: Won JD Retail Platform R&D Center 618 Technical Dare‑to‑Win Award.

September: Received Outstanding Ascend Native Developer award at Huawei Connect 2024.

6. Future Plans

Ten‑Thousand‑Card Cluster: Build a 10k‑card high‑performance network and scheduling capability by 2025, supporting mixed GPU‑NPU workloads.

Domestic Compute Ecosystem: Deepen collaboration with leading domestic chip vendors, enhance HCCL communication, develop unified operator libraries, and open source the training/inference frameworks for CTR and LLM scenarios.

For more details, see the images and links embedded in the original document.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI GPU Large Models distributed training NPU inference

Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.