Implementation and Performance Evaluation of a Domestic ARM‑Based High‑Performance Computing Cluster at Shanghai Jiao Tong University
The article describes how Shanghai Jiao Tong University built a campus‑level HPC platform using Huawei Kunpeng 920 ARM processors, detailing system architecture, unified storage and scheduling, containerized software deployment, network topology, Lustre file system integration, and performance results of LAMMPS and GATK compared with traditional X86 clusters.
China has achieved notable progress in high‑performance computing, but most university‑level clusters still rely on X86 CPUs; to promote domestic processor adoption, Shanghai Jiao Tong University constructed its first campus‑wide ARM‑based HPC platform using Huawei Kunpeng 920 processors.
The project addressed three major challenges: unfamiliar user workflows on ARM, the need to recompile and adapt mainstream X86 software, and the lack of performance‑tuned applications for the new architecture.
Three key solutions were implemented: (1) mounting a unified parallel file system (Lustre) and a SLURM job scheduler to provide a consistent user experience across heterogeneous clusters; (2) using containers (Singularity) to rapidly deploy pre‑compiled ARM‑compatible HPC applications; and (3) performing correctness verification and performance tuning of the pre‑compiled software.
The ARM cluster comprises 100 compute nodes, each equipped with dual 128‑core Kunpeng 920 CPUs and 192 GB DDR4‑2933 memory; it connects to the existing X86 and GPU clusters via a shared Lustre file system and high‑speed InfiniBand (100 Gbps) and Omni‑Path (100 Gbps) networks.
The network topology integrates five 40‑port InfiniBand switches and three LNet router nodes, forming a fat‑tree architecture that delivers up to 10 TB/s aggregate bandwidth between the access layer and compute nodes, ensuring 100 Gbps available bandwidth between any two nodes.
Lustre client installation on the ARM nodes required compiling version 2.12.4 against the custom CentOS 7.6 kernel and configuring distinct LNET labels to isolate the ARM cluster from X86 and storage networks.
Performance validation used LAMMPS (EAM and LJ benchmarks) and GATK 4.2. On ARM, LAMMPS achieved roughly twice the speed of a baseline Intel Xeon without acceleration and maintained a 1.5× advantage at 16 nodes; with Intel‑specific acceleration, ARM performance was about 60 % of the accelerated X86 platform. GATK modules lacking Intel‑specific optimizations ran at 50‑70 % of X86 speeds.
Overall, the new topology enables the ARM cluster to share the same parallel file system as existing X86 resources, provides containerized access to over 30 HPC applications, and attains 60‑70 % of X86 performance while achieving over 70 % average monthly utilization during the 2021 pilot run.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.