Fundamentals 16 min read

Mastering Linux Memory: Reclaim, Huge Pages, and NUMA Optimization

This article explains common Linux memory‑related performance bottlenecks—such as memory reclamation, page‑cache pressure, huge‑page usage, and cross‑NUMA access—and provides practical tuning methods to improve latency and throughput on servers and applications.

ByteDance SYS Tech
ByteDance SYS Tech
ByteDance SYS Tech
Mastering Linux Memory: Reclaim, Huge Pages, and NUMA Optimization

Introduction

Performance problems often manifest as slow UI response on phones or missed service‑level objectives on servers. In Linux, memory is a primary factor, with issues like memory reclaim, increased page faults, and cross‑NUMA accesses degrading user‑visible performance.

Memory Reclamation

The kernel caches disk data in page cache to speed up reads. When memory is scarce, it reclaims this cache, which can cause noticeable latency if the reclaimed pages are needed again.

Memory reclamation operates at two levels: the whole system and memory cgroups.

Per‑zone Watermarks

Three watermarks (min, low, high) control reclamation. Below low , kswapd runs asynchronously until memory reaches high . Below min , allocation is blocked, increasing latency and risking OOM.

The watermarks can be tuned via /proc/sys/vm/watermark_scale_factor , whose valid range is 0‑1000 (default 10). Raising the low watermark helps page‑cache‑heavy workloads by triggering earlier asynchronous reclaim.

Figure 1. per‑zone watermark

Memory cgroup Reclaim

When a memory cgroup reaches its limit, allocation blocks. Since kernel 5.19, the memory.reclaim interface allows userspace to request early reclamation, reducing the chance of blocking.

Huge Pages

Linux uses lazy allocation: a page fault occurs on first write, allocating a 4 KB page. Large pages (2 MB) reduce fault count and TLB pressure. Using huge pages can dramatically improve allocation and address‑translation speed, though they increase initialization cost and memory usage.

Performance can be measured with perf stat -e page-faults -p -- sleep 5 .

Static Huge Pages

Static huge pages (HugeTLB) are reserved at boot via kernel cmdline, e.g., hugepagesz=2M hugepages=512 , or dynamically via /proc/sys/vm/nr_hugepages and /sys/kernel/mm/hugepages interfaces.

<code>echo 20 > /proc/sys/vm/nr_hugepages</code>

Applications can allocate them with mmap(MAP_HUGETLB) or use libhugetlbfs to avoid code changes.

Drawbacks: explicit reservation, potential OOM if over‑reserved, and higher memory consumption.

Transparent Huge Pages (THP)

In THP always mode, the kernel tries to allocate a huge page on each fault; if it fails, it falls back to 4 KB pages and later merges them via the khugepaged thread. THP can be set to madvise mode, where applications explicitly request huge pages.

<code>echo madvise > /sys/kernel/mm/transparent_hugepage/enabled</code>

THP may increase memory usage, cause reclamation spikes, and hold long‑lasting write locks on mmap_lock , affecting performance.

mmap_lock

mmap_lock protects critical memory‑management structures. Write‑lock contention can arise in mmap/munmap, mremap, and THP merging. Using madvise(MADV_DONTNEED) or madvise(MADV_FREE) reduces write‑lock duration.

<code>for i in `ps -aux | grep " D" | awk '{ print $2}'`; do echo $i; cat /proc/$i/stack; done</code>

Tracing can be done with bpftrace:

<code>bpftrace -e 'tracepoint:mmap_lock:mmap_lock_start_locking /args->write == true/{ @[comm, kstack] = count();}'</code>

Cross‑NUMA Memory Access

Local node memory access is faster than remote. Use numastat to monitor remote accesses and watch -n 1 numastat -s for live view.

<code>watch -n 1 numastat -s</code>

Node Binding

Bind a process to a specific node and its CPUs with numactl to force local memory allocation, though this can limit memory availability and create CPU bottlenecks.

NUMA Balancing

Enable kernel‑wide automatic page migration via /proc/sys/kernel/numa_balancing or the numa_balancing= cmdline flag. Migration incurs page‑fault overhead and may increase cache misses.

Enable it only after confirming the workload benefits.

Conclusion

Memory tuning involves trade‑offs; no single setting fits all workloads. Analyze specific bottlenecks before applying reclamation thresholds, huge‑page policies, or NUMA optimizations, and avoid aggressive changes when performance is stable.

Memory Managementperformance tuningLinuxNUMAhuge pages
ByteDance SYS Tech
Written by

ByteDance SYS Tech

Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.