Artificial Intelligence 8 min read

Performance Evaluation of Cloud Music Online Estimation System on NUMA Architecture

Evaluating the Cloud Music online estimation system on NUMA‑based servers revealed that CPU pinning across both memory nodes dramatically boosts throughput on high‑end 96‑core machines—up to 75% for complex models—while low‑end servers gain only modestly, confirming NUMA‑aware scheduling’s critical role for CPU‑intensive inference workloads.

NetEase Cloud Music Tech Team

May 19, 2022

Performance Evaluation of Cloud Music Online Estimation System on NUMA Architecture

In recommendation services, the classic three‑layer logic consists of recall, ranking, and re‑ranking, with the ranking layer being the most complex and challenging. The online estimation system provides real‑time inference for the ranking stage.

The Cloud Music online estimation system has been iterated for over three years and now serves multiple scenarios such as music, search, live streaming, and innovative services.

The estimation pipeline includes three stages:

Feature query: retrieve user, scene, and item features from distributed storage.

Feature extraction: compute model‑required inputs (e.g., embeddings, hashes) from the queried features.

Forward inference: feed the extracted features into a machine‑learning library, perform matrix operations, and output a score for each user‑item pair.

Feature query is I/O‑bound, while feature extraction and forward inference are CPU‑bound, with the inference stage demanding intensive matrix multiplications and heavy memory allocation.

The system runs on dozens of network clusters comprising hundreds of physical machines. Early deployments used Intel Xeon E5 56‑core servers; later upgrades switched to Xeon Gold 96‑core servers. The high‑end machines have at least twice the compute capacity of the low‑end ones, but performance scaling is not linear in practice.

To investigate why the high‑end machines do not achieve the expected linear speedup, the architecture of the servers was examined. Modern servers have transitioned from Uniform Memory Access (UMA) to Non‑Uniform Memory Access (NUMA). In UMA, all CPUs share the same memory and bus, leading to limited scalability as CPU count grows. NUMA introduces separate memory nodes per CPU, providing fast local memory access while remote memory access incurs higher latency.

All production servers are configured with a 2‑node NUMA layout. The impact of NUMA locality (referred to as “nuclearity”) on the estimation system was evaluated by testing three deployment modes: single‑node, dual‑node without CPU pinning, and dual‑node with CPU pinning (using the numactl command).

Two models were used: a complex scenario (Model A) and a simple scenario (Model B). For each deployment mode, the request processing capacity was measured when CPU utilization reached 60%.

Low‑End Machine Results

Model A and Model B showed modest gains with dual‑node pinning (≈10‑20% increase in request throughput) but also a slight increase in latency for Model A. Dual‑node without pinning sometimes reduced throughput (≈10% drop for Model B). Overall, NUMA nuclearity had little effect on the low‑end machines because the compute resources were limited and memory contention was low.

High‑End Machine Results

For both models, dual‑node pinning yielded substantial throughput improvements: 75% for Model A and 49% for Model B compared with single‑node deployment. Dual‑node without pinning performed the worst due to increased thread count, leading to higher memory contention and thread‑switch overhead.

Conclusion

On high‑end machines, using dual‑node pinning increased request processing capacity by 169% for Model A and 112% for Model B, aligning with the expectation that doubling compute resources roughly doubles throughput. The performance hierarchy across the three deployment modes is:

Dual‑node pinning > Single‑node > Dual‑node without pinning .

The findings demonstrate that NUMA‑aware CPU pinning can significantly improve the performance of CPU‑intensive online inference workloads on modern multi‑core servers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance testing online inference NUMA CPU architecture server optimization

Written by

NetEase Cloud Music Tech Team

Official account of NetEase Cloud Music Tech Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.