Can't Master the Linux Kernel Without Understanding NUMA?
This article explains the core principles of NUMA architecture, how it is deeply integrated into Linux kernel memory management, process scheduling, and system calls, and provides practical commands and real‑world examples to diagnose and optimize NUMA‑related performance issues.
1. What Is NUMA?
NUMA (Non‑Uniform Memory Access) divides memory into multiple nodes, each tightly coupled with a set of CPU cores. Local memory accesses are faster, while remote accesses incur higher latency (1.5‑2×) and lower bandwidth.
2. NUMA System Architecture Details
The architecture consists of processor cores, memory controllers, memory nodes, and high‑speed interconnects (e.g., Intel QPI, AMD Infinity Fabric). Local accesses bypass the interconnect, while remote accesses traverse it, adding latency and potential bandwidth bottlenecks.
Local access is faster: shortest path, lowest latency.
Remote access is slower: traverses interconnect, latency 1.5‑2× higher.
An analogy: each NUMA node is like a building with its own file cabinets (local memory). Retrieving a file from your own cabinet is immediate; fetching from another building requires walking corridors and elevators.
3. NUMA Implementation in the Linux Kernel
3.1 Memory Management
The kernel automatically detects hardware topology via ACPI tables and builds a NUMA node map during boot. The pg_data_t structure describes each node’s layout. Functions such as numa_init allocate node descriptors, map physical memory, and establish inter‑node links.
Two key allocation policies are used:
Preferred local allocation: alloc_pages_current tries to allocate from the current node’s pool first, falling back to remote nodes only when necessary.
Memory migration: the kernel can move pages between nodes (e.g., via migrate_pages) to balance load and reduce remote accesses.
3.2 Process Scheduling
Process‑node affinity is expressed in the mems_allowed bitmap of task_struct. System calls sched_setaffinity and sched_getaffinity let user space bind a process to specific CPUs and nodes, ensuring it primarily uses local memory.
The kernel runs a migration_thread per node; when load imbalance is detected, it migrates processes to less‑loaded nodes. An intelligent migration strategy also analyzes a process’s memory‑access pattern and moves it to the node it accesses most frequently.
3.3 NUMA‑Aware System Calls
Calls such as get_mempolicy, set_mempolicy, and mbind let applications query or enforce memory policies, binding memory regions to specific nodes. The /proc/numa_maps interface provides per‑process memory distribution statistics.
4. Application Scenarios
4.1 Databases
Database workloads are memory‑intensive. Experiments with PostgreSQL showed that proper NUMA tuning (using numactl and mpol=prefer_local) increased query throughput by ~20% and reduced transaction latency by ~15%.
InnoDB buffer pools can be split and bound to individual nodes, improving cache hit rates; a large e‑commerce platform reported >30% QPS improvement after such tuning.
4.2 Virtualization
Virtual machines inherit the host’s NUMA topology. Aligning a VM’s vCPU and memory to the same physical node avoids “pseudo‑NUMA” penalties. In KVM, using numactl or libvirt node affinity raised VM CPU utilization by ~25% and cut memory latency by ~18%.
Advanced placement algorithms (e.g., MarVNFP) further reduce cross‑node traffic, achieving up to 40% less data transfer and ~30% lower latency compared with baseline methods.
5. Practical Commands to View and Configure NUMA
5.1 View NUMA Topology
# 查看NUMA硬件信息
numactl --hardwareTypical output on a dual‑node system:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 65408 MB
node 0 free: 42300 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 65536 MB
node 1 free: 45600 MB
node distances:
0 1
0: 10 21
1: 21 105.2 Check Process NUMA Affinity
# 查看进程ID为1234的NUMA亲和性
numactl --show-membind --show-cpunodebind -p 1234
# 查看所有进程的NUMA内存分配统计
numastatKey metrics: numa_hit (local accesses) vs. numa_foreign (remote accesses). High numa_foreign indicates a need for tuning.
5.3 Bind a Process to a Specific Node
# Bind at launch to node 0 (CPU + memory)
numactl --cpunodebind=0 --membind=0 ./your_application
# Bind a running process (PID = 1234) to node 0
numactl --cpunodebind=0 --membind=0 -p 1234
# Bind to specific CPUs within node 0
numactl --cpubind=0-7 --membind=0 ./your_applicationJava services can enable NUMA with JVM flags (JDK 8u262+):
java -XX:+UseNUMA -XX:NUMAInterleavingRatio=1 -jar your_app.jar6. Common NUMA Pitfalls
Cross‑node memory access bottleneck: A high‑concurrency service on a dual‑node server without NUMA binding may run on node 0 CPUs while allocating memory from node 1, causing latency spikes and low CPU utilization. A real‑world MySQL case showed a 30% QPS increase after binding the process to a single node.
Virtualization “pseudo‑NUMA”: Misaligned vCPU/vMemory leads to frequent remote accesses. Aligning VM resources to a physical node restores performance.
Disabling NUMA is not a cure: Turning NUMA off reverts the system to SMP, which can worsen contention on the shared bus for many‑core CPUs.
The correct approach is to keep NUMA enabled and tune node‑to‑process and memory allocations for maximal local access.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
