Backend Development 24 min read

Understanding Netty's Memory Management and Allocation Strategies

This article explains how Netty implements memory management by borrowing concepts from Jemalloc and Tcmalloc, detailing the hierarchy of arenas, chunks, pages and sub‑pages, the allocation algorithms for both large and small buffers, and the role of thread‑local caches in reducing fragmentation and improving performance.

政采云技术

Sep 7, 2023

Understanding Netty's Memory Management and Allocation Strategies

Background

Netty's memory management is not built from scratch; it draws inspiration from the Jemalloc allocator, which itself incorporates ideas from Google’s Tcmalloc such as red‑black‑tree based memory blocks, paging, and thread‑local caches. While many allocators share the same goals—efficient allocation/reclamation and reduced fragmentation—Jemalloc classifies memory into Small, Large and Huge categories, which helps lower fragmentation in large‑allocation scenarios.

In Linux, physical memory is divided into 4 KB pages. Internal fragmentation occurs when a request smaller than a page still consumes a whole page, whereas external fragmentation appears when multiple pages are allocated and later freed, leaving gaps between them.

Basic Concepts

Netty classifies memory by location (heap vs. direct) and by whether it is pooled. Each thread receives a private memory cache, while multiple threads can share an Arena. An arena manages a set of PoolChunkList s, TinySubPagePools and SmallSubPagePools to allocate memory efficiently.

Memory size categories are Huge, Normal, Small and Tiny. Allocation always starts with a Chunk (default 16 MB). Within a chunk, Netty defines Page (8 KB) and SubPage for finer granularity.

Chunk : the unit Netty requests from the OS; a chunk is a complete binary tree of 2048 pages.

Page : an 8 KB block inside a chunk; multiple pages are combined when a request exceeds 8 KB.

SubPage : a subdivision of a page used for allocations smaller than 8 KB. Tiny sub‑pages start at 16 B, Small sub‑pages at 512 B, 1 KB, 2 KB and 4 KB.

PoolArena

Netty adopts Jemalloc’s arena design: a fixed number of arenas (usually equal to the number of CPU cores) are created to reduce contention. When a thread first allocates memory, it selects an arena in a round‑robin fashion and sticks to that arena for its lifetime, which improves cache locality.

The arena contains two PoolSubPage arrays (for Tiny and Small) and six PoolChunkList s, each representing a range of chunk usage percentages. The six lists form a doubly‑linked chain.

// Memory usage 100% Chunk
q100 = new PoolChunkList<T>(this, null, 100, Integer.MAX_VALUE, chunkSize);

// Memory usage 75‑100% Chunk
q075 = new PoolChunkList<T>(this, q100, 75, 100, chunkSize);

// Memory usage 50‑75% Chunk
q050 = new PoolChunkList<T>(this, q075, 50, 75, chunkSize);

// Memory usage 25‑50% Chunk
q025 = new PoolChunkList<T>(this, q050, 25, 50, chunkSize);

// Memory usage 1‑25% Chunk
q000 = new PoolChunkList<T>(this, q025, 1, 25, chunkSize);

// Initialization list (chunks never reclaimed)
qInit = new PoolChunkList<T>(this, q000, Integer.MIN_VALUE, 1, chunkSize);

q100.prevList(q075);
q075.prevList(q050);
q050.prevList(q025);
q025.prevList(q000);
q000.prevList(null);
qInit.prevList(qInit);

The allocation algorithm prefers the q050 list (50‑100% usage) because it offers a balance between high utilization and a reasonable chance of successful allocation. qInit holds never‑reclaimed chunks, while q100 and q075 are tried last due to their high occupancy.

PoolChunkList

A PoolChunkList manages a set of PoolChunk s whose usage lies between minUsage and maxUsage. When a chunk’s usage exceeds maxUsage, it moves to the next list; when it falls below minUsage, it moves to the previous list. Overlapping usage ranges prevent a chunk from bouncing back and forth between two adjacent lists.

final class PoolChunkList<T> implements PoolChunkListMetric {
    private final PoolArena<T> arena;
    private final PoolChunkList<T> nextList;
    private final int minUsage;
    private final int maxUsage;
    private final int maxCapacity;
    private PoolChunk<T> head;
    private PoolChunkList<T> prevList;
    // ... other members and methods ...
}

PoolChunk

A PoolChunk (default 16 MB) stores the actual memory. It maintains a binary‑tree representation using two arrays: memoryMap (allocation state) and depthMap (node depth). Each chunk also holds an array of PoolSubpage s for tiny allocations.

final class PoolChunk<T> implements PoolChunkMetric {
    final PoolArena<T> arena;
    final T memory;
    private final byte[] memoryMap; // allocation state per node
    private final byte[] depthMap;  // depth of each node
    private final PoolSubpage<T>[] subpages;
    private int freeBytes;
    // ... other members ...
}

The binary tree has 2048 leaf nodes (pages). memoryMap is initialized with the same values as depthMap. When a node is allocated, its value is set to a sentinel (e.g., 12) and parent nodes are updated with the minimum of their children, enabling fast search for free nodes.

PoolSubpage

final class PoolSubpage<T> implements PoolSubpageMetric {
    final PoolChunk<T> chunk;
    private final int memoryMapIdx;
    private final int runOffset;
    private final long[] bitmap; // 1 bit per tiny block
    PoolSubpage<T> prev;
    PoolSubpage<T> next;
    int elemSize;
    private int maxNumElems;
    private int numAvail;
    // ... other members ...
}

Each sub‑page manages tiny blocks using a bitmap where each bit indicates whether a block is free (0) or used (1). For a 32 B block inside an 8 KB page, 256 blocks are needed, which requires 4 long values (256 / 64).

Allocation Strategies

Large allocations (> 8 KB)

Netty allocates whole pages (or multiple pages) by navigating the binary tree. The target node depth is computed as d = maxOrder - (log2(normCapacity) - pageShifts). The algorithm searches the corresponding PoolChunkList for a free node, marks it as used, updates freeBytes, and returns the handle.

private long allocateRun(int normCapacity) {
    int d = maxOrder - (log2(normCapacity) - pageShifts);
    int id = allocateNode(d);
    if (id < 0) return id;
    freeBytes -= runLength(id);
    return id;
}

Examples: allocating 8 KB, 16 KB and another 8 KB sequentially results in nodes 2048, 2050/2051 and 2049 being marked, with parent nodes updated accordingly.

Small allocations (≤ 8 KB)

For tiny requests, Netty first obtains the head of the appropriate PoolSubpage pool (tiny or small) from the arena, then allocates a leaf node from the binary tree, creates a PoolSubpage if necessary, splits the page into equal‑size blocks, links the sub‑page into the arena’s doubly‑linked list, and finally allocates a block via the bitmap.

private long allocateSubpage(int normCapacity) {
    PoolSubpage<T> head = arena.findSubpagePoolHead(normCapacity);
    int d = maxOrder;
    synchronized (head) {
        int id = allocateNode(d);
        if (id < 0) return id;
        final PoolSubpage<T>[] subpages = this.subpages;
        final int pageSize = this.pageSize;
        freeBytes -= pageSize;
        int subpageIdx = subpageIdx(id);
        PoolSubpage<T> subpage = subpages[subpageIdx];
        if (subpage == null) {
            subpage = new PoolSubpage<T>(head, this, id, runOffset(id), pageSize, normCapacity);
            subpages[subpageIdx] = subpage;
        } else {
            subpage.init(head, normCapacity);
        }
        return subpage.allocate();
    }
}

For a 20 B request, Netty rounds up to 32 B (Tiny). It finds a free leaf (e.g., node 2049), creates a PoolSubpage that splits the 8 KB page into 256 blocks, links it to tinySubpagePools[1], and allocates one 32 B block.

Thread‑local cache

When a thread repeatedly allocates the same size, the PoolThreadCache holds a small number of pre‑allocated blocks. After a configurable number of allocations (default 8192), the cache invokes trim() to release rarely used blocks back to the arena, reducing memory pressure.

boolean allocated = cache.allocate(buf, reqCapacity);
if (++allocations >= freeSweepAllocationThreshold) {
    allocations = 0;
    trim();
}

void trim() {
    trim(tinySubPageDirectCaches);
    trim(smallSubPageDirectCaches);
    // ... other caches ...
}

Summary

The article provides an in‑depth look at Netty’s memory allocation pipeline, from high‑level arena selection down to binary‑tree node management, sub‑page bitmap handling, and thread‑local caching. Understanding these components helps developers reason about performance, fragmentation, and tuning of Netty‑based servers.

References

https://netty.io/

Additional sections of the original source contain promotional calls to follow a public account and recruitment information, which are omitted from this technical summary.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Memory Management Backend Development Netty jemalloc memory allocation

Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Basic Concepts

PoolArena

PoolChunkList

PoolChunk

PoolSubpage

Allocation Strategies

Large allocations (> 8 KB)

Small allocations (≤ 8 KB)

Thread‑local cache

Summary

References

政采云技术

How this landed with the community

Was this worth your time?

0 Comments

Large allocations (> 8 KB)

Small allocations (≤ 8 KB)