Huawei Ascend 950 NPU Architecture Deep Dive – Full Whitepaper Inside

The article provides a detailed technical analysis of Huawei's Ascend 950 NPU series, covering its one‑chip dual‑structure for training and inference, SIMD/SIMT dual‑mode compute, ultra‑fine memory granularity, PD separation, native FP4 support, a high‑bandwidth 2.0 interconnect, and a fully self‑developed yet CUDA‑compatible ecosystem.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Huawei Ascend 950 NPU Architecture Deep Dive – Full Whitepaper Inside

1. One‑Chip Dual‑Structure: Splitting Training and Inference

The Ascend 950 family uses a single die to produce two variants: 950PR, optimized for large‑model pre‑fill and recommendation workloads, and 950DT, targeting training and long‑text decoding. 950PR, mass‑produced from March 2026, features 128 GB of HiBL 1.0 high‑bandwidth memory with 1.6 TB/s bandwidth, supports FP8/MXFP8/HiF8 low‑precision formats, and delivers 1 PFLOPS of FP8 compute for fast pre‑fill and KV‑cache generation. 950DT, slated for Q4 2026, upgrades to 144 GB memory and 4 TB/s bandwidth, boosting performance 1.5× over PR and reaching 2 PFLOPS of FP4 compute, eliminating bandwidth bottlenecks in token‑wise decoding.

2. Architectural Revolution: From Da Vinci to "GPU‑like" Design

2.1 SIMD/SIMT Dual‑Mode Co‑existence

The core compute units implement a novel SIMD/SIMT dual programming model. SIMD mode processes vector data in pipelines, ideal for regular tasks such as recommendation systems and computer vision, maximizing throughput. SIMT mode handles fragmented, parallel data, fitting NLP long‑text and large‑model decoding, allowing the chip to adapt seamlessly to both structured and irregular workloads.

2.2 Memory Subsystem Optimized to 128‑Byte Granularity

Memory access granularity is reduced from the previous 512 bytes to 128 bytes, a "microscopic" optimization that cuts wasted bandwidth when handling sparse data, improving efficiency by over 30 % for large‑model decoding and recommendation scenarios.

2.3 PD Separation Architecture

The Prefill/Decode (PD) separation decouples compute and storage resources for the two phases. Prefill uses high compute, low bandwidth; Decode uses high bandwidth, low compute. This resource matching cuts inference latency by 50 % and doubles concurrency, removing the classic "one‑card‑cannot‑serve‑all" limitation.

2.4 Full‑Stack Self‑Developed + Ecosystem Compatibility

All stack layers—from instruction set to interconnect protocol—are self‑designed, while maintaining compatibility with CUDA core APIs. This enables direct migration of overseas large models without code rewrites, lowering ecosystem entry barriers and preserving security autonomy.

3. Low‑Precision Breakthrough: Native FP4 Support

Ascend 950 uniquely supports FP4 (4‑bit ultra‑low precision) alongside FP8/MXFP8/HiF8. FP4 reduces memory usage to one‑quarter of FP16 and half of FP8; a single card with 144 GB memory provides an effective 576 GB of FP16 capacity, allowing trillion‑parameter models to run on a single chip. FP4 delivers 2 PFLOPS—2.87 × the 0.543 PFLOPS of Nvidia H100—and cuts high‑concurrency inference latency by 70 %.

4. Lingqu 2.0 Interconnect: 8192‑Card Full Mesh

The 2.0 interconnect provides 2 TB/s bandwidth and reduces single‑hop latency from 2 µs to 200 ns (10× improvement). A full‑optical Mesh topology boosts rack‑to‑rack bandwidth tenfold, with cross‑rack latency of only 7 µs, enabling 8192‑card full‑mesh clusters. The Atlas 950 supernode supports 8192 direct‑connected cards, achieving 16.3 PB total bandwidth—62 × Nvidia NVLink—and easily handles training of trillion‑parameter models.

5. Breaking the Barrier: Autonomous AI Compute Ecosystem

Beyond the chip, Ascend 950 serves as the core piece of Huawei's end‑to‑end autonomous AI stack, offering 100 % self‑controlled hardware, memory, interconnect, and software toolchain, eliminating supply‑risk and lock‑in. Cost is only one‑quarter of Nvidia H2 while delivering superior performance. The ecosystem spans native large‑model support, domestic servers, and operating systems, forming a complete "chip‑server‑model‑application" chain.

Conclusion

The Ascend 950 combines precise, efficient, autonomous, and open design: one‑chip dual‑structure for scenario‑specific optimization, dual‑mode SIMD/SIMT flexibility, FP4 low‑precision efficiency, and Lingqu 2.0 interconnect for massive clusters. It positions itself not as a follower but as a definition‑setter for Chinese AI chips, providing a robust compute foundation for the next wave of trillion‑parameter models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SIMDNPUAI acceleratorHuaweiinterconnectFP4Ascend 950
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.