Design and Implementation of a High‑Concurrency Memory Pool in C++
This article presents a comprehensive design and implementation of a high‑concurrency memory pool in C++, covering concepts such as fixed‑size block allocation, thread‑local caches, central and page caches, lock‑free techniques, span management, and performance benchmarking against standard malloc.
The article introduces memory pools (fixed‑size‑block allocation) as an alternative to direct new / malloc calls, explaining how pre‑allocating large memory chunks reduces fragmentation and improves performance.
Basic concepts describe the pool, dynamic allocation, internal and external fragmentation, and the role of malloc as a built‑in pool.
Fixed‑size memory pool design outlines three steps: opening memory with malloc or system page allocation, allocating objects by carving the large block, and releasing objects back into a free list stored in the first few bytes of each block.
#pragma once
#include
#include
#include
using std::cout;
using std::endl;
template
class myObject {
public:
T* New() {
T* obj = nullptr;
if (_freeList) {
void* next = *(void**)_freeList;
obj = (T*)_freeList;
_freeList = next;
} else {
if (_remainBytes < sizeof(T)) {
_remainBytes = 1024 * 128;
_memory = (char*)SystemAlloc(_remainBytes >> 13);
if (_memory == nullptr) throw std::bad_alloc();
}
obj = (T*)_memory;
size_t objSize = sizeof(T) < sizeof(void*) ? sizeof(void*) : sizeof(T);
_memory += objSize;
_remainBytes -= objSize;
}
new(obj) T;
return obj;
}
void Delete(T* obj) {
obj->~T();
*(void**)obj = _freeList;
_freeList = obj;
}
private:
char* _memory = nullptr;
size_t _remainBytes = 0;
void* _freeList = nullptr;
};Thread‑local cache (ThreadCache) provides lock‑free allocation for small objects. Each thread owns a ThreadCache with an array of FreeList buckets indexed by size class. Allocation first checks the local free list; if empty, it fetches a batch from the central cache using a slow‑start algorithm.
void* ThreadCache::Allocate(size_t size) {
assert(size <= MAX_BYTES);
size_t alignSize = SizeClass::RoundUp(size);
size_t index = SizeClass::Index(size);
if (!_freeLists[index].Empty())
return _freeLists[index].Pop();
else
return FetchFromCentralCache(index, alignSize);
}
void* ThreadCache::FetchFromCentralCache(size_t index, size_t size) {
size_t batchNum = std::min(_freeLists[index].MaxSize(), SizeClass::NumMoveSize(size));
if (batchNum == _freeLists[index].MaxSize())
++_freeLists[index].MaxSize();
void* start = nullptr; void* end = nullptr;
size_t actualNum = CentralCache::GetInstance()->FetchRangeObj(start, end, batchNum, size);
if (actualNum > 1)
_freeLists[index].PushRange(NextObj(start), end, actualNum - 1);
return start;
}Central cache (CentralCache) holds spans of memory obtained from the page cache. It supplies batches of objects to thread caches and receives returned objects for reuse. Span selection, locking, and batch fetching are handled here.
size_t CentralCache::FetchRangeObj(void*& start, void*& end, size_t batchNum, size_t size) {
size_t index = SizeClass::Index(size);
_spanLists[index]._mtx.lock();
Span* span = GetOneSpan(_spanLists[index], size);
assert(span && span->_freeList);
start = span->_freeList;
end = start;
size_t i = 0, actualNum = 1;
while (i < batchNum - 1 && NextObj(end) != nullptr) {
end = NextObj(end);
++i; ++actualNum;
}
span->_freeList = NextObj(end);
NextObj(end) = nullptr;
span->_useCount += actualNum;
_spanLists[index]._mtx.unlock();
return actualNum;
}Page cache (PageCache) manages memory at the page granularity (8 KB pages). It allocates large spans from the OS, splits them into smaller spans when needed, and merges adjacent free spans to reduce external fragmentation.
Span* PageCache::NewSpan(size_t k) {
assert(k > 0);
if (k > NPAGES - 1) {
void* ptr = SystemAlloc(k);
Span* span = _spanPool.New();
span->_pageId = ((PAGE_ID)ptr >> PAGE_SHIFT);
span->_n = k;
_idSpanMap.set(span->_pageId, span);
return span;
}
if (!_spanLists[k].Empty())
return _spanLists[k].PopFront();
// search larger buckets, split, or allocate a big span (128 pages) and recurse
// ... (omitted for brevity) ...
return NewSpan(k);
}Global allocation interface replaces malloc / free with ConcurrentAlloc and ConcurrentFree . For requests larger than MAX_BYTES (256 KB) it allocates whole pages via PageCache ; otherwise it uses the thread‑local cache.
static void* ConcurrentAlloc(size_t size) {
if (size > MAX_BYTES) {
size_t alignSize = SizeClass::RoundUp(size);
size_t kPage = alignSize >> PAGE_SHIFT;
Span* span = PageCache::GetInstance()->NewSpan(kPage);
span->_objSize = size;
return (void*)(span->_pageId << PAGE_SHIFT);
} else {
if (pTLSThreadCache == nullptr) {
static myObject
tcPool;
pTLSThreadCache = tcPool.New();
}
return pTLSThreadCache->Allocate(size);
}
}
static void ConcurrentFree(void* ptr) {
Span* span = PageCache::GetInstance()->MapObjectToSpan(ptr);
size_t size = span->_objSize;
if (size > MAX_BYTES) {
PageCache::GetInstance()->_pageMtx.lock();
PageCache::GetInstance()->ReleaseSpanToPageCache(span);
PageCache::GetInstance()->_pageMtx.unlock();
} else {
assert(pTLSThreadCache);
pTLSThreadCache->Deallocate(ptr, size);
}
}The implementation also integrates a radix‑tree based page‑to‑span mapping (TCMalloc_PageMap2) to replace a hash map, reducing lock contention and improving scalability on multi‑core systems.
Performance tests compare the custom allocator against the system malloc under multi‑threaded workloads. Results show a significant reduction in allocation/deallocation time, demonstrating the efficiency gains of the three‑level cache design.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.