Fundamentals 19 min read

Unlock Faster C++ Performance: Practical Jemalloc Optimization Techniques

This article explains the fundamentals of Linux memory allocation, introduces Jemalloc’s core algorithms and data structures, and provides concrete optimization steps—including arena tuning, tcache configuration, and slab size adjustments—to achieve measurable CPU savings in high‑concurrency C++ services.

ByteDance SYS Tech
ByteDance SYS Tech
ByteDance SYS Tech
Unlock Faster C++ Performance: Practical Jemalloc Optimization Techniques

Introduction

Jemalloc is a high‑performance malloc implementation widely used in multithreaded, high‑concurrency services. ByteDance’s STE team identified it as a top CPU hotspot and began deep optimization in 2019.

Memory Allocation Basics

Linux provides system calls (brk, sbrk, mmap, munmap) for heap management, but direct use is error‑prone. Allocators such as ptmalloc wrap these calls, offering malloc / free interfaces that hide complexity and reduce fragmentation.

Memory fragmentation (internal and external) remains a key metric for allocator quality.

Common Allocation Algorithms

First fit – allocate the first block large enough for the request.

Next fit – start searching from the last allocated address.

Best fit – sort free blocks and pick the smallest that fits.

Buddy allocation – splits and merges blocks in powers‑of‑two.

Slab allocation – uses fixed‑size slots within a slab.

Buddy Allocation

Buddy allocation manages memory in powers of two. When a small request cannot be satisfied by a block twice the size, the larger block is repeatedly split until a suitably sized block is obtained; large requests trigger merging.

Buddy allocation eliminates inter‑block fragmentation but can suffer up to 50% internal fragmentation for sizes like 2 KB + 1 B.

Slab Allocation

Slab allocation reduces kernel‑mode transitions by keeping memory in contiguous slabs divided into equal‑size slots tracked by a bitmap. The slab size is the least common multiple of the requested size and the page size, ensuring minimal waste.

Jemalloc Overview

Jemalloc implements malloc with high efficiency, low fragmentation, built‑in profiling, and many tunable parameters that can be adjusted without recompiling.

Jemalloc Core Algorithms and Data Structures

Separate handling of large and small sizes (threshold ≈ 3.5 pages) to reduce fragmentation.

Prefer low‑address reuse to keep memory on fewer pages.

Define size classes and slab classes to limit fragmentation.

Strictly limit Jemalloc’s own metadata size.

Use multiple arenas, each managing a subset of threads, to minimize lock contention.

Extent

An extent is a memory object managed by an arena. Large allocations use the buddy algorithm on extents; small allocations use slab allocation within extents. Each extent’s size is a multiple of the page size, tracked with a bitmap, and classified as active, dirty, muzzy, or retained, forming a multilevel cache.

Small‑size Alignment and Slab Size

Jemalloc aligns small sizes by dividing them into a group (the highest set bit) and a mod (the two bits below it). All sizes in the same group share the same step size after alignment. The slab size is then the least common multiple of the aligned size and the page size (e.g., 128 B → 4 KB slab, 160 B → 20 KB slab).

Tcache and Arena

Jemalloc reduces lock contention by combining per‑thread caches (tcache) with multiple arenas. Allocation flow: align the requested size, locate the corresponding bin, allocate from tcache if a slot is available, otherwise request a fill from the arena. When tcache is empty, the arena fills it; when tcache is full, a flush returns half the cached memory to the arena.

Optimization Strategies

Increase arena count (e.g., malloc_conf:narenas:128 ) to reduce threads per arena.

Bind heavy threads to exclusive arenas via mallctl .

Adjust slab sizes to match workload characteristics.

Tune dirty_decay_ms and muzzy_decay_ms to control reclamation latency.

Raise ncached_max for tcache bins to reduce arena fills.

Practical Tuning Steps

Dump statistics with malloc_stats_print or by setting MALLOC_CONF=stats_print:true , then analyze:

Arena‑to‑thread ratio.

Mutex contention in extents.

Bin fill counts (e.g., high nfills for sizes 521 B, 1 KB, 2 KB, 4 KB).

Typical configuration commands:

<code>// reference: https://jemalloc.net/jemalloc.3.html
void malloc_stats_print(void (*write_cb)(void *, const char *), void *cbopaque, const char *opts);
</code>
<code>export MALLOC_CONF=stats_print:true</code>
<code>export MALLOC_CONF=narenas:128</code>
<code>export MALLOC_CONF=dirty_decay_ms:10000,muzzy_decay_ms:5000</code>
<code>export MALLOC_CONF=tcache_nslots_small_min:20,tcache_nslots_small_max:200,lg_tcache_nslots_mul:1</code>

Case Study

In a service with 1 776 threads and 256 arenas, increasing arenas to 1 024 reduced CPU usage by 4.5% while memory grew modestly. Binding ~80 high‑load threads to exclusive arenas and tuning decay parameters saved an additional 4% CPU, with memory usage remaining stable.

Conclusion

Jemalloc is a versatile allocator; systematic monitoring and parameter tuning can consistently deliver 3‑4% CPU savings across many ByteDance services.

performance optimizationC++Linuxjemallocmemory allocationarenatcache
ByteDance SYS Tech
Written by

ByteDance SYS Tech

Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.