Mastering NUMA on Linux: Optimize Memory Allocation with numactl
This guide explains NUMA memory hierarchy, shows how to install and use the numactl command, interprets hardware and NUMA statistics, and presents memory allocation strategies to improve performance on multi‑node Linux systems.
Preparing the Environment
The examples assume Ubuntu 16.04 but work on other Linux distributions. The test machine has 32 CPUs and 64 GB RAM.
NUMA Storage Hierarchy
1) Processor layer: a single physical core. 2) Local node layer: all processors within a node. 3) Home node layer: nodes adjacent to the local node. 4) Remote node layer: non‑local or non‑adjacent nodes. Access latency increases with node distance, so keeping a process on a single CPU module can greatly improve performance.
CPU chip composition (Kunpeng 920 example)
The Kunpeng 920 SoC groups six core clusters, two I/O clusters, and four DDR controllers into a single chip. Each chip integrates four 72‑bit DDR4 channels (up to 3200 MT/s) supporting up to 512 GB × 4 DDR. L3 cache is split into TAG and DATA parts; TAG resides in each core cluster to reduce latency, while DATA connects to the on‑chip bus. The Hydra Home Agent handles cache coherence across chips, and a GICD module provides interrupt distribution compatible with ARM GICv4. Only one GICD is visible to the OS when multiple clusters exist.
Using numactl
Install the numactl tool (not installed by default) on Ubuntu:
<code>sudo apt install numactl -y</code>Check the manual with
man numactlor
numactl --help. View the system's NUMA configuration:
<code>numactl --hardware</code>Sample output shows four nodes, each with eight CPUs and about 16 GB of memory, plus L3 cache allocation per node.
The
numastatcommand reports statistics such as
numa_hit,
numa_miss,
numa_foreign,
interleave_hit,
local_node, and
other_node. A high
numa_missindicates the need to adjust allocation policies, for example by binding processes to specific CPUs.
<code>root@ubuntu:~# numastat
node0 node1 node2 node3
numa_hit 19480355292 11164752760 12401311900 12980472384
numa_miss 5122680 122652623 88449951 7058
numa_foreign 122652643 88449935 7055 5122679
interleave_hit 12619 13942 14010 13924
local_node 19480308881 11164721296 12401264089 12980411641
other_node 5169091 122684087 88497762 67801</code>NUMA Memory Allocation Strategies
Common options for
numactl:
--localallocor
-l: allocate memory from the local node.
--membind=nodesor
-m nodes: restrict allocation to specified nodes.
--preferred=node: prefer a node, fall back to others if unavailable.
--interleave=nodesor
-i nodes: allocate memory round‑robin across nodes.
<code>numactl --interleave=all mongod -f /etc/mongod.conf</code>Because the default NUMA policy prefers local memory, imbalance can cause swap usage on a node with insufficient memory, leading to the “swap insanity” phenomenon and severe performance degradation. Operators should monitor NUMA memory distribution and tune system parameters (e.g., memory reclaim, swap tendency) to avoid excessive swapping.
Node → Socket → Core → Processor
Modern CPUs are packaged into sockets; each socket contains multiple cores, and hyper‑threading creates logical processors (threads). In terminology, a socket corresponds to a NUMA node, a core is a physical CPU, and a thread is a logical CPU (processor).
Using lscpu
Typical output fields:
Architecture
CPU(s): logical CPU count
Thread(s) per core
Core(s) per socket
Socket(s)
L1d cache, L1i cache, L2 cache, L3 cache
NUMA node0 CPU(s), etc.
Example:
<code>root@ubuntu:~# lscpu
Architecture: x86_64
CPU(s): 32
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 4
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31</code>Preview
Next, we will discuss how binding CPUs to processes can further boost program performance.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.