JD Retail's End‑to‑End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Real‑World Applications
This article details JD Retail's AI engine that seamlessly supports both GPU and domestic NPU hardware, describing its heterogeneous cluster architecture, unified training and inference APIs, performance optimizations, extensive model coverage, and multiple production use cases across e‑commerce, logistics, and intelligent assistance.
In recent years, the rapid rise of domestic AI chips in China has created a critical need for model adaptation, performance optimization, and practical deployment on these chips. JD Retail's Nine‑Number Algorithm Platform (九数算法中台) addresses this by building a full‑stack AI engine that is compatible with both GPU and domestic NPU, spanning from hardware clusters to algorithmic engines and multi‑scenario applications.
Challenges
Significant hardware architecture differences: Existing JD compute clusters were GPU‑centric, while domestic NPU architectures differ greatly, requiring a unified scheduling and resource management system.
Immature software ecosystem: Open‑source frameworks lack native NPU support, leading to high migration costs for precision validation and performance tuning.
Diverse and complex business scenarios: JD Retail’s varied workloads demand a single solution that can be flexibly applied across many use cases.
AI Engine Architecture
The platform builds a thousand‑card scale cluster with high‑performance networking, offering identical scheduling capabilities for NPU and GPU. A unified API supports mainstream models, enabling zero‑cost training and deployment on either hardware.
Heterogeneous GPU‑NPU Scheduling System
Thousand‑card cluster: Visual monitoring, health checks, automatic fault isolation, and continuous HDK upgrades ensure stability.
Scheduling optimizations: NUMA‑aware and network‑topology‑aware scheduling, resource fragmentation minimization, and configurable priority eviction guarantee fairness and high utilization.
Efficient resource queues: Shared queues provide guaranteed and elastic resources for both NPU and GPU, maximizing overall cluster usage.
High‑Performance Training Engine
The engine supports over 40 mainstream LLM and multimodal base models (e.g., Qwen, Llama, ChatGLM) with a single API that allows seamless GPU‑NPU switching. Key techniques include MFU optimization, model quantization, Triton compilation, flash attention, dynamic input stitching, and pipeline parallelism, achieving up to 60% MFU on hundred‑card clusters and near‑linear scaling for trillion‑parameter models.
Model
Scale
GPU
Domestic NPU
GPU Inference
NPU Inference
SR1.5 Search‑Ad Large Model
3B/7B/15B
✅
✅
✅
✅
Qwen2.5
0.5B‑14B
✅
✅
✅
✅
ChatGLM2
6B
✅
✅
✅
✅
Additional rows omitted for brevity
High‑Performance Inference Engine
The inference engine provides MaaS (Model‑as‑a‑Service) with one‑click NPU deployment, supporting 20+ industry‑standard LLMs and offering a 20% performance boost over open‑source frameworks. Optimizations include GE graph compilation, ATB high‑performance operators, W8A8/W4A16 quantization, prefill/decode separation, KV‑cache, and comprehensive monitoring and alerting.
Real‑World Cases
Case 1: Video Tag Cloud Generation
Using Qwen2‑VL on NPU, JD Retail extracts multi‑modal keywords from videos for tag cloud generation, achieving comparable quality and latency to GPU deployments across dozens of NPU cards.
Case 2: Logistics Address Parsing
Fine‑tuned Qwen2‑7B on NPU reaches 91.03% accuracy in address parsing, matching GPU results and powering large‑scale POI classification and pre‑sorting tasks.
Case 3: Merchant Smart Assistant
Fine‑tuned Qwen1.5‑7B on NPU provides merchant‑focused QA assistance with 96% tool‑assignment consistency compared to GPU models.
#Input_1
青海省西宁市城北区三其村。可以发圆通吗 谢谢。
#Output‑NPU(Domestic NPU)
青海省_1,西宁市_3A,城北区_3A,三其村_4B, _5A-1,可以发圆通吗 谢谢_UNK,
#Output‑GPU(GPU)
青海省_1,西宁市_3A,城北区_3A,三其村_4B, _5A-1,可以发圆通吗 谢谢 _UNKApplication Value
Core technology autonomy: Reduces reliance on foreign chips, ensuring security and controllability from hardware to applications.
Domestic chip applicability: Deployed in search, recommendation, ad creation, intelligent客服, and data analysis, providing feedback to the domestic chip ecosystem.
Industry Impact
2024: Co‑founded the openMind community with Huawei Ascend.
July: Presented five best‑practice scenarios at the Ascend AI Summit.
July: Won JD Retail’s 618 Technical Courage Award.
September: Received the Outstanding Ascend Native Developer award at Huawei CONNECT 2024.
Future Plans
By 2025, JD Retail aims to build a ten‑thousand‑card heterogeneous cluster with mixed GPU‑NPU scheduling, further optimizing resource prediction, dynamic scaling, and emergency pools to fully exploit domestic compute power.
Continued collaboration with leading domestic chip vendors will drive deeper HCCL communication optimizations, fused operator libraries, and open‑source contributions for both LLM and CTR training/inference workloads.
Recommended Reading:
Behind JD Takeaway: Mapping & Trajectory Technologies
Deep Dive into Double‑11 Logistics Guarantees
1‑Second Response, 90% Decision Accuracy: JD Merchant Smart Assistant
35W+ Merchants Choose JD AIGC Platform: What Makes Content Generation Great?
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.