Fundamentals 21 min read

In‑Depth Analysis of Loongson GS464E CPU Architecture and Performance

This article provides a comprehensive technical review of the Chinese Loongson GS464E processor, covering its micro‑architectural design choices, instruction‑fetch and out‑of‑order execution units, cache hierarchy, benchmark results, manufacturing details, and the challenges it faces in competing with mainstream Intel and AMD CPUs.

Architects' Tech Alliance

Sep 28, 2020

In‑Depth Analysis of Loongson GS464E CPU Architecture and Performance

The Loongson (龙芯) project has long been a focal point of Chinese CPU development, and the newly disclosed GS464E micro‑architecture—used in the 3A2000/3B2000 series—has sparked intense interest regarding its performance relative to mainstream Intel and AMD processors.

Historically, Loongson adopted the MIPS ISA because early design goals (circa 2000) pre‑dated the high‑performance ARM cores; the MIPS/DEC Alpha ecosystem offered a more mature high‑performance foundation, and the project has remained ISA‑compatible while pursuing independent micro‑architectural and layout design.

The front‑end fetch unit shows notable advances: a 64 KB, four‑way set‑associative instruction cache (larger than IBM Power7), an 8‑instruction‑per‑cycle fetch width (32 bytes per cycle), and the inclusion of a loop detector and loop buffer similar to Intel’s Sandy Bridge, allowing up to 56 loop instructions to be stored.

Conversely, the design also reveals weaknesses: a shared 16‑entry MSHR for instruction and data caches, a 64‑entry fully‑associative L1 instruction TLB without a second‑level TLB, and a modest issue width (estimated 4–6 instructions) compared with Intel’s wider dispatch pipelines.

The cache hierarchy comprises a 64 KB, four‑way L1 data cache with serial tag‑then‑data access, a 256 KB “Victim Cache” that functions as a private L2, and a 1 MB per‑core SCache (four cores combine to a 4 MB shared L3), indicating a move toward NoC‑style interconnects for future many‑core scaling.

Benchmark data (SPEC CPU2000, Dhrystone, Coremark, Whetstone, etc.) collected at a 1 GHz clock show integer IPC gains of ~104 % and floating‑point gains of ~278 % over the previous generation, with performance approaching that of a 1 GHz Sandy Bridge Core i5‑2300, though absolute throughput remains roughly 20‑30 % of a comparable Haswell core due to the low frequency and 40 nm process.

While Loongson has secured military and aerospace contracts, its civilian market prospects remain limited by the performance gap with Intel/AMD; future improvements will depend on moving to finer‑node processes (e.g., 28 nm) and further micro‑architectural refinements.

performance analysis CPU architecture Microarchitecture MIPS Loongson GS464E

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.