Fundamentals 67 min read

Linux Memory Management: Bootmem, Buddy, and Slab Allocators Explained

This article provides an in‑depth overview of Linux memory management, covering the three primary allocators—bootmem, buddy, and slab—including their data structures, initialization, allocation and free mechanisms, code examples, and the role of memblock and zone watermarks.

Deepin Linux

Dec 4, 2023

Linux Memory Management: Bootmem, Buddy, and Slab Allocators Explained

Linux Memory Management Overview

Linux memory management is the process by which the operating system efficiently utilizes and manages memory resources, encompassing memory allocation, page replacement, and memory mapping. The kernel employs three major allocators: the boot memory allocator (bootmem), the buddy allocator, and the slab allocator.

1. Boot Memory Allocator (bootmem)

The boot memory allocator is used early in kernel initialization before the buddy system becomes operational. It relies on a bitmap to track page usage and provides functions for initialization, reservation, freeing, and allocation of memory.

1.1 Data Structure

typedef struct bootmem_data {
    unsigned long node_min_pfn;
    unsigned long node_low_pfn;
    void *node_bootmem_map;
    unsigned long last_end_off;
    unsigned long hint_idx;
    struct list_head list;
} bootmem_data_t;

Key fields:

node_min_pfn / node_low_pfn – start and end PFN of the node.

node_bootmem_map – bitmap where each bit represents a page (0 = free, 1 = reserved).

last_end_off – offset of the last allocated region.

hint_idx – index hint for the next allocation.

list – links all bootmem_data structures.

1.2 Core Functions

Initialization

static unsigned long __init init_bootmem_core(bootmem_data_t *bdata,
    unsigned long mapstart, unsigned long start, unsigned long end)
{
    unsigned long mapsize;
    mminit_validate_memmodel_limits(&start, &end);
    bdata->node_bootmem_map = phys_to_virt(PFN_PHYS(mapstart));
    bdata->node_min_pfn = start;
    bdata->node_low_pfn = end;
    link_bootmem(bdata);
    mapsize = bootmap_bytes(end - start);
    memset(bdata->node_bootmem_map, 0xff, mapsize);
    bdebug("nid=%td start=%lx map=%lx end=%lx mapsize=%lx
",
        bdata - bootmem_node_data, start, mapstart, end, mapsize);
    return mapsize;
}

Linking

static void __init link_bootmem(bootmem_data_t *bdata)
{
    bootmem_data_t *ent;
    list_for_each_entry(ent, &bdata_list, list) {
        if (bdata->node_min_pfn < ent->node_min_pfn) {
            list_add_tail(&bdata->list, &ent->list);
            return;
        }
    }
    list_add_tail(&bdata->list, &bdata_list);
}

Freeing Reserved Pages

static void __init __free(bootmem_data_t *bdata,
            unsigned long sidx, unsigned long eidx)
{
    unsigned long idx;
    if (bdata->hint_idx > sidx)
        bdata->hint_idx = sidx;
    for (idx = sidx; idx < eidx; idx++)
        if (!test_and_clear_bit(idx, bdata->node_bootmem_map))
            BUG();
}

Reserving Pages

static int __init __reserve(bootmem_data_t *bdata, unsigned long sidx,
            unsigned long eidx, int flags)
{
    unsigned long idx;
    int exclusive = flags & BOOTMEM_EXCLUSIVE;
    for (idx = sidx; idx < eidx; idx++)
        if (test_and_set_bit(idx, bdata->node_bootmem_map)) {
            if (exclusive) {
                __free(bdata, sidx, idx);
                return -EBUSY;
            }
            bdebug("silent double reserve of PFN %lx
", idx + bdata->node_min_pfn);
        }
    return 0;
}

Allocation

static void * __init alloc_bootmem_bdata(struct bootmem_data *bdata,
                    unsigned long size, unsigned long align,
                    unsigned long goal, unsigned long limit)
{
    unsigned long fallback = 0;
    unsigned long min, max, start, sidx, midx, step;
    // ... (logic to find a suitable zero‑bit range, align it, and reserve it)
    return region;
}

The bootmem allocator is gradually replaced by the more scalable memblock allocator.

2. Memblock Allocator

Memblock is the modern early‑boot memory allocator used after bootmem. It manages two main types of regions: memory (available RAM) and reserved (areas that must not be used for general allocation). The core data structures are struct memblock_type and struct memblock_region.

2.1 Data Structures

struct memblock_type {
    unsigned long cnt;    /* number of regions */
    unsigned long max;    /* allocated array size */
    phys_addr_t total_size;    /* total size of all regions */
    struct memblock_region *regions;
    char *name;
};

struct memblock_region {
    phys_addr_t base;
    phys_addr_t size;
    unsigned long flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
    int nid;
#endif
};

Key operations include adding, removing, and iterating over regions.

2.2 Adding a Region

int __init_memblock memblock_add_range(struct memblock_type *type,
                phys_addr_t base, phys_addr_t size,
                int nid, unsigned long flags)
{
    // Handles empty array case, expands the region array if needed,
    // and inserts the new region while preserving order.
    // ... (implementation details)
    return 0;
}

2.3 Removing a Region

static int __init_memblock memblock_remove_range(struct memblock_type *type,
                      phys_addr_t base, phys_addr_t size)
{
    int start_rgn, end_rgn, i, ret;
    ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
    if (ret)
        return ret;
    for (i = end_rgn - 1; i >= start_rgn; i--)
        memblock_remove_region(type, i);
    return 0;
}

Memblock also provides functions such as memblock_find_in_range and memblock_reserve that are used by later allocators.

3. Buddy Allocator

After the kernel has finished initialization, the buddy system manages physical pages. It groups contiguous pages into blocks called "page blocks" and uses an order (2ⁿ pages) to represent block size. Allocation splits larger blocks until a suitably sized block is obtained; freeing merges buddies when possible.

3.1 Allocation Process

The allocator searches the free list of the requested order. If none are available, it looks at higher orders, splits a larger block, places the remainder into the appropriate lower‑order free list, and returns the allocated block.

static __always_inline struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
            int migratetype)
{
    unsigned int current_order;
    struct free_area *area;
    struct page *page;
    for (current_order = order; current_order < MAX_ORDER; ++current_order) {
        area = &(zone->free_area[current_order]);
        page = list_first_entry_or_null(&area->free_list[migratetype],
                  struct page, lru);
        if (!page)
            continue;
        list_del(&page->lru);
        rmv_page_order(page);
        area->nr_free--;
        expand(zone, page, order, current_order, area, migratetype);
        set_pcppage_migratetype(page, migratetype);
        return page;
    }
    return NULL;
}

3.2 Watermarks and Zones

Each memory zone (DMA, DMA32, Normal, etc.) defines three watermarks: high, low, and min. Allocation checks whether the number of free pages after the request stays above the selected watermark; otherwise it may trigger reclamation or fallback to other zones.

#define min_wmark_pages(z) (z->watermark[WMARK_MIN])
#define low_wmark_pages(z) (z->watermark[WMARK_LOW])
#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])

Zone selection is performed via zonelists, which can be ordered by node distance or zone priority, providing the fallback mechanism when the preferred zone cannot satisfy the request.

4. Slab Allocator

The slab allocator builds on top of the buddy system to efficiently allocate small objects (e.g., task_struct, inode). It creates a cache ( kmem_cache) for each object type, where each cache consists of one or more slabs—contiguous pages divided into equal‑sized objects.

4.1 Core Data Structures

struct kmem_cache {
    struct kmem_cache_cpu __percpu *cpu_slab;   /* per‑CPU fast path */
    unsigned long flags;
    unsigned long min_partial;
    int size;                /* object size including metadata */
    int object_size;         /* size without metadata */
    int offset;              /* free‑pointer offset */
    struct kmem_cache_order_objects oo; /* preferred order */
    struct kmem_cache_order_objects max;
    struct kmem_cache_order_objects min;
    gfp_t allocflags;
    int refcount;
    void (*ctor)(void *);
    const char *name;
    struct list_head list;
    struct kmem_cache_node *node[MAX_NUMNODES];
};

struct kmem_cache_cpu {
    void **freelist;   /* next free object */
    unsigned long tid; /* transaction id */
    struct page *page; /* current slab */
#ifdef CONFIG_SLUB_CPU_PARTIAL
    struct page *partial; /* partially used slab */
#endif
};

struct kmem_cache_node {
    spinlock_t list_lock;
    struct list_head slabs_partial; /* partially used slabs */
    struct list_head slabs_full;    /* fully used slabs */
    struct list_head slabs_free;    /* empty slabs */
    unsigned long free_objects;
    unsigned int free_limit;
    unsigned int colour_next;
    struct array_cache *shared;
    struct alien_cache **alien;
    unsigned long next_reap;
    int free_touched;
};

4.2 Allocation Path

The fast path obtains an object from the per‑CPU freelist. If the per‑CPU slab is empty, the allocator falls back to the node’s partial list, then to the global slab lists, and finally allocates a new slab from the buddy allocator.

static __always_inline void *slab_alloc_node(struct kmem_cache *s,
    gfp_t gfpflags, int node, unsigned long addr)
{
    void *object;
    struct kmem_cache_cpu *c = raw_cpu_ptr(s->cpu_slab);
    object = c->freelist;
    if (unlikely(!object || !node_match(c->page, node)))
        object = __slab_alloc(s, gfpflags, node, addr, c);
    return object;
}

4.3 Slab Creation

When no suitable slab exists, a new slab is allocated from the buddy system using alloc_slab_page, which respects the cache’s preferred order and may fall back to a lower order if memory is fragmented.

static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
    struct kmem_cache_order_objects oo = s->oo;
    struct page *page = alloc_slab_page(s, flags, node, oo);
    if (unlikely(!page)) {
        oo = s->min;
        page = alloc_slab_page(s, flags, node, oo);
        if (unlikely(!page))
            return NULL;
    }
    return page;
}

4.4 User API

Kernel code allocates objects via kmalloc / kfree (wrappers around the slab cache) or via explicit cache functions:

void *kmalloc(size_t size, gfp_t flags);
void kfree(const void *obj);
struct kmem_cache *kmem_cache_create(const char *name, size_t size,
    size_t align, unsigned long flags, void (*ctor)(void *));
void *kmem_cache_alloc(struct kmem_cache *cache, gfp_t flags);
void kmem_cache_free(struct kmem_cache *cache, void *obj);

Statistics and debugging information can be inspected through /proc/slabinfo and the various GFP_* flags (e.g., GFP_KERNEL, GFP_ATOMIC, GFP_DMA) control allocation behavior such as waiting, I/O permission, and zone preference.

5. Summary

Linux’s memory management hierarchy starts with the early‑boot bootmem allocator, transitions to the flexible memblock system, then relies on the buddy allocator for page‑level management, and finally uses the slab allocator to efficiently serve small object allocations. Together they provide scalability, low fragmentation, and fast allocation paths essential for kernel performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux Slab Allocator buddy allocator Allocators

Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.