Fundamentals 57 min read

Design and Implementation of a High‑Concurrency Memory Pool in C++

This article presents a comprehensive design and implementation of a high‑concurrency memory pool in C++, covering concepts such as fixed‑size block allocation, thread‑local caches, central and page caches, lock‑free techniques, span management, and performance benchmarking against standard malloc.

Deepin Linux
Deepin Linux
Deepin Linux
Design and Implementation of a High‑Concurrency Memory Pool in C++

The article introduces memory pools (fixed‑size‑block allocation) as an alternative to direct new / malloc calls, explaining how pre‑allocating large memory chunks reduces fragmentation and improves performance.

Basic concepts describe the pool, dynamic allocation, internal and external fragmentation, and the role of malloc as a built‑in pool.

Fixed‑size memory pool design outlines three steps: opening memory with malloc or system page allocation, allocating objects by carving the large block, and releasing objects back into a free list stored in the first few bytes of each block.

#pragma once
#include
#include
#include
using std::cout;
using std::endl;

template
class myObject {
public:
    T* New() {
        T* obj = nullptr;
        if (_freeList) {
            void* next = *(void**)_freeList;
            obj = (T*)_freeList;
            _freeList = next;
        } else {
            if (_remainBytes < sizeof(T)) {
                _remainBytes = 1024 * 128;
                _memory = (char*)SystemAlloc(_remainBytes >> 13);
                if (_memory == nullptr) throw std::bad_alloc();
            }
            obj = (T*)_memory;
            size_t objSize = sizeof(T) < sizeof(void*) ? sizeof(void*) : sizeof(T);
            _memory += objSize;
            _remainBytes -= objSize;
        }
        new(obj) T;
        return obj;
    }
    void Delete(T* obj) {
        obj->~T();
        *(void**)obj = _freeList;
        _freeList = obj;
    }
private:
    char* _memory = nullptr;
    size_t _remainBytes = 0;
    void* _freeList = nullptr;
};

Thread‑local cache (ThreadCache) provides lock‑free allocation for small objects. Each thread owns a ThreadCache with an array of FreeList buckets indexed by size class. Allocation first checks the local free list; if empty, it fetches a batch from the central cache using a slow‑start algorithm.

void* ThreadCache::Allocate(size_t size) {
    assert(size <= MAX_BYTES);
    size_t alignSize = SizeClass::RoundUp(size);
    size_t index = SizeClass::Index(size);
    if (!_freeLists[index].Empty())
        return _freeLists[index].Pop();
    else
        return FetchFromCentralCache(index, alignSize);
}

void* ThreadCache::FetchFromCentralCache(size_t index, size_t size) {
    size_t batchNum = std::min(_freeLists[index].MaxSize(), SizeClass::NumMoveSize(size));
    if (batchNum == _freeLists[index].MaxSize())
        ++_freeLists[index].MaxSize();
    void* start = nullptr; void* end = nullptr;
    size_t actualNum = CentralCache::GetInstance()->FetchRangeObj(start, end, batchNum, size);
    if (actualNum > 1)
        _freeLists[index].PushRange(NextObj(start), end, actualNum - 1);
    return start;
}

Central cache (CentralCache) holds spans of memory obtained from the page cache. It supplies batches of objects to thread caches and receives returned objects for reuse. Span selection, locking, and batch fetching are handled here.

size_t CentralCache::FetchRangeObj(void*& start, void*& end, size_t batchNum, size_t size) {
    size_t index = SizeClass::Index(size);
    _spanLists[index]._mtx.lock();
    Span* span = GetOneSpan(_spanLists[index], size);
    assert(span && span->_freeList);
    start = span->_freeList;
    end = start;
    size_t i = 0, actualNum = 1;
    while (i < batchNum - 1 && NextObj(end) != nullptr) {
        end = NextObj(end);
        ++i; ++actualNum;
    }
    span->_freeList = NextObj(end);
    NextObj(end) = nullptr;
    span->_useCount += actualNum;
    _spanLists[index]._mtx.unlock();
    return actualNum;
}

Page cache (PageCache) manages memory at the page granularity (8 KB pages). It allocates large spans from the OS, splits them into smaller spans when needed, and merges adjacent free spans to reduce external fragmentation.

Span* PageCache::NewSpan(size_t k) {
    assert(k > 0);
    if (k > NPAGES - 1) {
        void* ptr = SystemAlloc(k);
        Span* span = _spanPool.New();
        span->_pageId = ((PAGE_ID)ptr >> PAGE_SHIFT);
        span->_n = k;
        _idSpanMap.set(span->_pageId, span);
        return span;
    }
    if (!_spanLists[k].Empty())
        return _spanLists[k].PopFront();
    // search larger buckets, split, or allocate a big span (128 pages) and recurse
    // ... (omitted for brevity) ...
    return NewSpan(k);
}

Global allocation interface replaces malloc / free with ConcurrentAlloc and ConcurrentFree . For requests larger than MAX_BYTES (256 KB) it allocates whole pages via PageCache ; otherwise it uses the thread‑local cache.

static void* ConcurrentAlloc(size_t size) {
    if (size > MAX_BYTES) {
        size_t alignSize = SizeClass::RoundUp(size);
        size_t kPage = alignSize >> PAGE_SHIFT;
        Span* span = PageCache::GetInstance()->NewSpan(kPage);
        span->_objSize = size;
        return (void*)(span->_pageId << PAGE_SHIFT);
    } else {
        if (pTLSThreadCache == nullptr) {
            static myObject
tcPool;
            pTLSThreadCache = tcPool.New();
        }
        return pTLSThreadCache->Allocate(size);
    }
}

static void ConcurrentFree(void* ptr) {
    Span* span = PageCache::GetInstance()->MapObjectToSpan(ptr);
    size_t size = span->_objSize;
    if (size > MAX_BYTES) {
        PageCache::GetInstance()->_pageMtx.lock();
        PageCache::GetInstance()->ReleaseSpanToPageCache(span);
        PageCache::GetInstance()->_pageMtx.unlock();
    } else {
        assert(pTLSThreadCache);
        pTLSThreadCache->Deallocate(ptr, size);
    }
}

The implementation also integrates a radix‑tree based page‑to‑span mapping (TCMalloc_PageMap2) to replace a hash map, reducing lock contention and improving scalability on multi‑core systems.

Performance tests compare the custom allocator against the system malloc under multi‑threaded workloads. Results show a significant reduction in allocation/deallocation time, demonstrating the efficiency gains of the three‑level cache design.

PerformanceConcurrencyC++thread cachepage cachememory poolallocator
Deepin Linux
Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.