Big Data 10 min read

Oak Off‑Heap Key‑Value Map and Its Application in Apache Druid for Real‑Time and Batch Ingestion

The article introduces Oak, an off‑heap concurrent key‑value map, explains its design and performance benefits over ConcurrentSkipListMap, and details extensive offline and real‑time ingestion experiments in Apache Druid that demonstrate reduced memory usage, lower CPU consumption, and faster data loading.

Beike Product & Technology
Beike Product & Technology
Beike Product & Technology
Oak Off‑Heap Key‑Value Map and Its Application in Apache Druid for Real‑Time and Batch Ingestion

Oak (Off‑heap Allocated Keys) is a scalable, concurrent key‑value map that stores all keys and values outside the JVM heap, allowing up to three times more data to be kept in the same memory footprint and providing strong atomic semantics for read, write, and range‑scan operations.

The map’s index consists of a contiguous memory block list that improves cache locality, while keys and values are copied into self‑managed off‑heap byte arrays. The builder pattern is used to configure serializers, comparators, and memory capacity, as shown in the following code:

OakMapBuilder builder = new OakMapBuilder()
    .setKeySerializer(new MyAppKeySerializer())
    .setValueSerializer(new MyAppValueSerializer())
    .setMinKey(...)
    .setKeysComparator(new MyAppKeyComparator())
    .setMemoryCapacity(...);
OakMap oakMap = builder.build();
public OakMap build() {
    ...
    return new OakMap<>(minKey, keySerializer, valueSerializer, comparator, chunkMaxItems, valuesMemoryManager, keysMemoryManager);
}

Key advantages of Oak over the traditional ConcurrentSkipListMap include fine‑grained synchronization for better thread scalability, off‑heap storage that eliminates JVM GC pauses and enables handling of datasets larger than 50 GB, and a richer API for atomic data access.

In the Druid community, Oak’s incremental index has been evaluated as a replacement for the on‑heap IncrementalIndex in both batch (hadoop‑index) and streaming (kafka‑index) ingestion pipelines.

Offline ingestion tests used two data sources (≈20 M rows and >1 B rows) on a 128 GB, 48‑core machine. Oak was integrated by replacing the ConcurrentSkipListMap in the IndexGenerator stage. Results showed that, for large data sets, Oak reduced average ingestion time to 80‑85 % of the on‑heap baseline while using less memory.

Real‑time ingestion tests employed a Kafka index job with a 4 GB JSON data set under a 48 GB memory limit. Oak allowed 12 parallel index tasks versus 10 for on‑heap, and achieved roughly 60 % of the memory and CPU consumption while delivering nearly double the throughput.

Based on these findings, the Druid 0.18.1 codebase was modified: keys and values are stored off‑heap via OakMap, the implementation no longer maintains a row‑to‑actual‑row mapping, and the index remains ordered. Several interface changes were made to IncrementalIndex and IncrementalIndexRow , adding methods such as public Object getDim(int index) , public int getdimlength() , public boolean isDimNull(int index) , and public IndexedInts getStringDim(final int dimIndex) .

The core data‑ingestion logic in OakIncrementalIndex.addFacts was updated to handle roll‑up mode by assigning row indices only on insertion, avoiding duplicate keys.

To enable Oak in Druid, the tuning configuration must specify "appendableIndexSpec": { "type": "oak" } . This activates the off‑heap incremental index for both batch and streaming jobs.

Overall, the experiments demonstrate that Oak’s off‑heap indexing significantly improves memory efficiency and CPU utilization, making it a compelling alternative for high‑throughput, large‑scale analytics workloads.

JavaperformanceBig DataApache DruidIncremental IndexOakOff-heap
Beike Product & Technology
Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.