Fundamentals 27 min read

TCMalloc: Architecture, Principles, Usage, and Performance Comparison

This article provides a comprehensive overview of Google’s TCMalloc memory allocator, detailing its three‑level cache architecture, allocation and reclamation strategies, installation methods, configuration options, and performance advantages over other allocators in C++ backend, game, and database applications.

Deepin Linux

Jan 16, 2025

TCMalloc: Architecture, Principles, Usage, and Performance Comparison

TCMalloc Overview

In the digital era, application performance is critical, especially for C++ developers who need efficient memory management. TCMalloc (Thread‑Caching Malloc), an open‑source allocator from Google, addresses multithreaded allocation bottlenecks by using per‑thread caches and a three‑level hierarchy to improve speed and reduce lock contention.

1. Introduction to TCMalloc

1.1 What is TCMalloc?

TCMalloc replaces the default system allocator (malloc, free, new, delete) with a high‑performance allocator that provides per‑thread local caches (ThreadCache) and manages memory in three layers: ThreadCache, CentralCache, and PageHeap.

1.2 Background and Development

With the rise of multicore and hyper‑threading, traditional allocators like glibc's ptmalloc2 suffer from lock contention. Google created TCMalloc to reduce this contention by giving each thread its own cache, resulting in significant speed gains for high‑concurrency servers and real‑time games.

2. TCMalloc System Architecture

2.1 Detailed Architecture

Front‑end : Provides fast allocation/reallocation using per‑thread and per‑CPU caches.

Middle‑end : Supplies memory to the front‑end when its cache is exhausted.

Back‑end : Obtains memory from the operating system and feeds it to the middle‑end.

Each thread has an independent ThreadCache. If its cache lacks sufficient blocks, it requests memory from CentralCache, which in turn obtains memory from PageHeap. When PageHeap cannot satisfy a request, it asks the OS for more pages.

Page : The OS memory unit, typically 8 KB. TCMalloc manages memory in pages. Span : A contiguous group of pages managed as a single unit. ThreadCache : Per‑thread cache holding free lists for different size classes. Size Class : Groups objects of the same size; small allocations (<256 KB) are mapped to a size class. CentralCache : Shared cache that aggregates memory from multiple threads. PageHeap : Manages large spans and interacts directly with the OS.

2.2 Front‑end

The front‑end handles requests for specific size blocks using a lock‑free cache that only the owning thread can access. If the cache is empty, it pulls a batch of memory from the middle‑end.

2.3 Middle‑end

The middle‑end consists of a TransferCache and a Central free list. It mediates between the front‑end and back‑end, using mutexes for synchronization.

2.4 Back‑end

The back‑end manages large unused blocks, obtains new memory from the OS when needed, and returns excess memory to the OS. It supports both a legacy PageHeap and a huge‑page‑aware PageHeap.

3. Principles of TCMalloc

3.1 Three‑Level Cache Mechanism

ThreadCache provides lock‑free allocation for small objects (<32 KB) by keeping a private free list per thread. CentralCache is a shared cache protected by a spinlock, distributing memory to ThreadCaches when they run low. PageHeap is the top‑level cache that obtains large pages from the OS and splits them into spans for lower layers.

3.2 Allocation Strategies

For small allocations, TCMalloc first checks ThreadCache, then CentralCache, and finally PageHeap. Large allocations (≥32 KB) bypass the caches and call the OS directly (e.g., via mmap or sbrk).

3.3 Reclamation Strategies

Freed small blocks return to the originating ThreadCache; excess blocks are returned to CentralCache, which may merge adjacent blocks and eventually give them back to PageHeap, which can release whole pages to the OS.

4. Using TCMalloc

4.1 Installation

Source installation (TCMalloc) :

/etc/yum.repos.d/bazel.repo
[copr:copr.fedorainfracloud.org:vbatts:bazel]
name=Copr repo for bazel owned by vbatts
baseurl=https://download.copr.fedorainfracloud.org/results/vbatts/bazel/epel-7-$basearch/
...

Install Bazel: yum install bazel3 Clone and build TCMalloc:

git clone https://github.com/google/tcmalloc.git
cd tcmalloc && bazel test //tcmalloc/...

gperftools installation (which includes TCMalloc):

git clone https://github.com/gperftools/gperftools.git

Generate build tools: autogen.sh Configure and compile:

configure --disable-debugalloc --enable-minimal
make -j4
make install

On 64‑bit Linux, install libunwind before gperftools to avoid deadlocks, or enable frame‑pointer support with -fno-omit-frame-pointer and --enable-frame-pointers.

4.2 Basic Usage Example

#include <iostream>
#include <stdlib.h>

int main() {
    void* ptr = malloc(1024);
    if (ptr) {
        // use memory
        free(ptr);
    }
    return 0;
}

Link with -ltcmalloc to replace the standard malloc/free with TCMalloc implementations.

4.3 Configuration and Tuning

Environment variables such as TCMALLOC_RELEASE_RATE (default 1.0) control how aggressively memory is returned to the OS, and TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES limits total thread‑cache size.

5. Application Cases

5.1 Game Development

In a large 3D RPG, switching to TCMalloc reduced frame‑rate drops caused by frequent small allocations, improving average FPS by 15‑20%.

5.2 Database Scenarios

Replacing MySQL’s allocator with TCMalloc lowered response times by 30‑40% and increased throughput by 25‑35% under 500 concurrent connections.

6. Comparison with Other Allocators

6.1 vs. glibc malloc

TCMalloc outperforms glibc malloc in multithreaded small‑allocation workloads (e.g., 6× faster in a 1‑million‑operation benchmark) due to its per‑thread caches and finer‑grained locking.

6.2 vs. Other Allocators (JeMalloc, dlmalloc, ptmalloc)

JeMalloc uses arena‑based allocation and excels in low‑fragmentation large‑data workloads, while TCMalloc remains superior for high‑concurrency small‑object allocation. dlmalloc and ptmalloc are older designs that generally lag behind TCMalloc in modern multithreaded scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multithreading TCMalloc Memory Allocator C performance

Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.