Fundamentals 15 min read

Understanding Linux Kernel Physical Memory Management and the memblock Allocator

Linux does not expose all physical memory to users; this article explains why the kernel reserves memory, detailing the early memblock allocator, crash kernel reservations, page structure overhead, and the handoff to the buddy system, illustrating how these mechanisms consume several hundred megabytes of RAM.

Refining Core Development Skills
Refining Core Development Skills
Refining Core Development Skills
Understanding Linux Kernel Physical Memory Management and the memblock Allocator

1. Early memblock allocator

The kernel uses two physical‑memory managers: an early allocator used during boot and the buddy system used after the system is up. Early Linux kernels used bootmem , but since 2010 they have been replaced by the memblock allocator.

# dmidecode
Memory Device
  Total Width: Unknown
  Data Width: Unknown
  Size: 16384 MB
  Manufacturer: QEMU
......

After the hardware detection phase, the kernel calls e820__memory_setup to store the detected ranges in the global e820_table , then creates the memblock allocator with e820__memblock_setup .

//file:arch/x86/kernel/setup.c
void __init setup_arch(char **cmdline_p) {
    ...
    e820__memory_setup();
    e820__memblock_setup();
    ...
}

The memblock structure simply keeps two arrays – one for usable regions and one for reserved regions – each entry describing a physical address range.

//file:mm/memblock.c
struct memblock memblock __initdata_memblock = {
    .memory.regions  = memblock_memory_init_regions,
    .memory.cnt      = 1, /* empty dummy entry */
    .memory.max      = INIT_MEMBLOCK_MEMORY_REGIONS,
    .memory.name     = "memory",

    .reserved.regions = memblock_reserved_init_regions,
    .reserved.cnt     = 1, /* empty dummy entry */
    .reserved.max     = INIT_MEMBLOCK_RESERVED_REGIONS,
    .reserved.name    = "reserved",

    .bottom_up        = false,
    .current_limit   = MEMBLOCK_ALLOC_ANYWHERE,
};

#define INIT_MEMBLOCK_REGIONS   128
#define INIT_MEMBLOCK_RESERVED_REGIONS  INIT_MEMBLOCK_REGIONS
#define INIT_MEMBLOCK_MEMORY_REGIONS    INIT_MEMBLOCK_REGIONS

static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_MEMORY_REGIONS] __initdata_memblock;
static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_RESERVED_REGIONS] __initdata_memblock;

During creation, the kernel iterates over every entry in the e820_table . If the entry type is E820_TYPE_SOFT_RESERVED , it is added to the reserved list via memblock_reserve ; otherwise it is added to the memory list via memblock_add . After the loop the allocator prints its state with memblock_dump_all() .

//file:arch/x86/kernel/e820.c
void __init e820__memblock_setup(void) {
    for (i = 0; i < e820_table->nr_entries; i++) {
        struct e820_entry *entry = &e820_table->entries[i];
        if (entry->type == E820_TYPE_SOFT_RESERVED)
            memblock_reserve(entry->addr, entry->size);
        memblock_add(entry->addr, entry->size);
    }
    memblock_dump_all();
}

Enabling the debug output requires adding memblock=debug to the kernel command line (e.g., in /boot/grub/grub.cfg ) and rebooting. The resulting dmesg shows the total physical memory, the amount reserved, and each region’s start/end addresses.

# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.4.143.bsk.8-amd64 ... memblock=debug
[    0.010238] MEMBLOCK configuration:
[    0.010239]  memory size = 0x00000003fff78c00 reserved size = 0x0000000003c6d144
[    0.010240]  memory.cnt  = 0x3
[    0.010241]  memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes flags: 0x0
[    0.010243]  memory[0x1] [0x0000000000100000-0x00000000bffd9fff], 0x00000000bfeda000 bytes flags: 0x0
[    0.010244]  memory[0x2] [0x0000000100000000-0x000000043fffffff], 0x0000000340000000 bytes flags: 0x0
[    0.010245]  reserved.cnt  = 0x4
[    0.010246]  reserved[0x0] [0x0000000000000000-0x0000000000000fff], 0x0000000000001000 bytes flags: 0x0
[    0.010247]  reserved[0x1] [0x00000000000f5a40-0x00000000000f5b83], 0x0000000000000144 bytes flags: 0x0
[    0.010248]  reserved[0x2] [0x0000000001000000-0x000000000340cfff], 0x000000000240d000 bytes flags: 0x0
[    0.010249]  reserved[0x3] [0x0000000034f31000-0x000000003678ffff], 0x000000000185f000 bytes flags: 0x0

2. Memory consumption by kernel subsystems

2.1 Crash kernel reservation

Linux reserves a separate kernel (the crash kernel) for kdump. The reservation is performed early via reserve_crashkernel_low and reserve_crashkernel , which allocate memory from the memblock allocator and log the amount reserved.

//file:arch/x86/kernel/setup.c
static int __init reserve_crashkernel_low(void) {
    ...
    low_base = memblock_phys_alloc_range(low_size, CRASH_ALIGN, 0, CRASH_ADDR_LOW_MAX);
    pr_info("Reserving %ldMB of low memory at %ldMB for crashkernel (low RAM limit: %ldMB)\n",
            (unsigned long)(low_size >> 20),
            (unsigned long)(low_base >> 20),
            (unsigned long)(low_mem_limit >> 20));
    ...
}

static void __init reserve_crashkernel(void) {
    ...
    low_base = memblock_phys_alloc_range(low_size, CRASH_ALIGN, 0, CRASH_ADDR_LOW_MAX);
    pr_info("Reserving %ldMB of memory at %ldMB for crashkernel (System RAM: %ldMB)\n",
            (unsigned long)(crash_size >> 20),
            (unsigned long)(crash_base >> 20),
            (unsigned long)(total_mem >> 20));
    ...
}

On the example VM the logs show two 128 MiB reservations, totaling 512 MiB that remain unavailable to user programs.

[    0.010832] Reserving 128MB of low memory at 2928MB for crashkernel (System low RAM: 3071MB)
[    0.010835] Reserving 128MB of memory at 17264MB for crashkernel (System RAM: 16383MB)

2.2 Page‑structure overhead

Linux manages physical memory in 4 KiB pages. Each page is represented by a struct page (commonly 64 bytes). The allocation of one struct page per page consumes about 1.56 % of total RAM. For a 16 GiB system this amounts to roughly 256 MiB.

//file:include/linux/mm_types.h
struct page {
    unsigned long flags;
    ...
}

The page‑management initialization is performed by paging_init , which is reached via the boot path start_kernel → setup_arch → x86_init.paging.pagetable_init → paging_init . After allocating all struct page objects, the kernel hands the usable memory over to the buddy allocator.

start_kernel
-> setup_arch
   -> e820__memory_setup   // store e820 table
   -> e820__memblock_setup // build memblock allocator
   -> x86_init.paging.pagetable_init (native_pagetable_init)
      -> paging_init        // initialise page management
-> mm_init
   -> mem_init
      -> memblock_free_all  // hand over to buddy system

3. Handing memory to the buddy system

When memblock_free_all runs, it first reserves the regions that were marked as reserved, then iterates over all free ranges and calls __free_memory_core to give them to the buddy allocator. The number of pages handed over is added to the global _totalram_pages counter.

//file:mm/memblock.c
void __init memblock_free_all(void) {
    unsigned long pages;
    ...
    pages = free_low_memory_core_early();
    totalram_pages_add(pages);
}

static unsigned long __init free_low_memory_core_early(void) {
    // reserve memory hand‑off
    memmap_init_reserved_pages();
    // hand‑off usable memory
    for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL)
        count += __free_memory_core(start, end);
    return count;
}

The _totalram_pages variable is defined in mm/page_alloc.c and exported for other parts of the kernel.

//file:mm/page_alloc.c
atomic_long_t _totalram_pages __read_mostly;
EXPORT_SYMBOL(_totalram_pages);

4. Summary

The Linux kernel never makes the entire physical RAM available to user space; a portion is always consumed by the kernel itself for early memory management, crash‑kernel reservation, page‑structure metadata, NUMA zones, and other internal data structures. Understanding the memblock allocator, the crash‑kernel reservation, and the page‑structure overhead explains why free -m often reports less memory than the hardware specification.

Beyond the kernel, user‑space runtimes also have their own memory‑management overhead. For example, Go’s TCMalloc uses a mspan structure that contains linked‑list pointers and bitmap fields, which are not directly usable by applications.

// src/runtime/mheap.go
type mspan struct {
    next *mspan   // next span in list
    prev *mspan   // previous span

    allocBits  *gcBits // allocation bitmap
    gcmarkBits *gcBits // GC mark bitmap
    ...
}
Memory ManagementKernelLinuxCrash KernelMemBlockPage Structures
Refining Core Development Skills
Written by

Refining Core Development Skills

Fei has over 10 years of development experience at Tencent and Sogou. Through this account, he shares his deep insights on performance.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.