Big Data 18 min read

Understanding BitMap and Roaring BitMap: Principles, Containers, and Java API Usage

The article explains BitMap fundamentals and introduces Roaring BitMap’s compressed container architecture—Array, BitMap, and Run containers—detailing their conversion logic, Java implementation snippets, performance advantages over traditional BitSets, and practical API usage for high‑performance, memory‑efficient big‑data applications.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Understanding BitMap and Roaring BitMap: Principles, Containers, and Java API Usage

This article, authored by a member of the Vivo Internet Server Team, provides a comprehensive overview of BitMap and Roaring BitMap, focusing on their underlying principles, storage mechanisms, container types, and practical usage in Java.

1. Introduction

While Bloom filters and HyperLogLog are often used for deduplication and cardinality estimation in big‑data scenarios, they introduce approximation errors. For exact results, BitMap can be employed, but traditional BitMap may consume excessive memory for sparse data. Roaring BitMap, a compressed sparse bitmap, addresses this issue and is widely adopted in databases and big‑data engines such as Hive, Spark, Doris, and Kylin.

2. BitMap Principle

BitMap uses individual bits to represent the presence of elements. For example, a single byte with bits set at positions 1, 2, 4, and 6 represents the set {1,2,4,6}. In Java, storing four integers with a plain int array would require 16 bytes, whereas a BitMap can represent the same information with a single byte.

Advantages: low memory footprint compared to arrays, fast bit‑wise operations. Disadvantages: cannot store duplicate values and wastes space for highly sparse data.

3. Roaring BitMap Principle

Roaring BitMap partitions a 32‑bit integer into high 16 bits (used as an index) and low 16 bits (the actual value). Each high‑16‑bit index points to a container that stores the corresponding low‑16‑bit values. Up to 2 16 containers can exist, each holding up to 2 16 values.

The three container types are:

ArrayContainer : stores low‑16‑bit values in a sorted short array. Suitable for sparse data; memory usage grows linearly (2 bytes per entry). Search complexity O(log N). Maximum size 4096 entries (≈8 KB).

BitMapContainer : a plain BitMap of 2 16 bits stored in a long array (8 KB). Ideal for dense data; operations are O(1).

RunContainer : uses run‑length encoding to compress consecutive values. In the best case (all 65536 values present) it occupies only 4 bytes; in the worst case up to 128 KB. Search complexity O(log N).

The container selection logic is:

Start with ArrayContainer . When the element count exceeds 4096, it converts to BitMapContainer .

If the count falls back to ≤4096, the BitMapContainer may revert to ArrayContainer .

Conversion to RunContainer occurs only after explicit user‑initiated optimization.

4. Core Java Implementations

Below are key snippets from the Roaring BitMap Java library.

public void add(final int x) {
    final short hb = Util.highbits(x);
    final int i = highLowContainer.getIndex(hb);
    if (i >= 0) {
        highLowContainer.setContainerAtIndex(i,
            highLowContainer.getContainerAtIndex(i).add(Util.lowbits(x)));
    } else {
        final ArrayContainer newac = new ArrayContainer();
        highLowContainer.insertNewKeyValueAt(-i - 1, hb, newac.add(Util.lowbits(x)));
    }
}

The add method extracts the high 16 bits, locates the appropriate container via binary search, and either inserts the low 16‑bit value or creates a new ArrayContainer when necessary.

protected int getIndex(short x) {
    if ((size == 0) || (keys[size - 1] == x)) {
        return size - 1;
    }
    return this.binarySearch(0, size, x);
}

Container‑specific add implementations illustrate the different strategies:

// ArrayContainer add (simplified)
if (cardinality == 0 || (cardinality > 0 && toIntUnsigned(x) > toIntUnsigned(content[cardinality - 1]))) {
    if (cardinality >= DEFAULT_MAX_SIZE) {
        return toBitMapContainer().add(x);
    }
    if (cardinality >= this.content.length) {
        increaseCapacity();
    }
    content[cardinality++] = x;
} else {
    int loc = Util.unsignedBinarySearch(content, 0, cardinality, x);
    if (loc < 0) {
        // insert at proper position
        if (cardinality >= DEFAULT_MAX_SIZE) {
            return toBitMapContainer().add(x);
        }
        if (cardinality >= this.content.length) {
            increaseCapacity();
        }
        System.arraycopy(content, -loc - 1, content, -loc, cardinality + loc + 1);
        content[-loc - 1] = x;
        ++cardinality;
    }
}
return this;
// BitMapContainer add
final int x = Util.toIntUnsigned(i);
final long previous = BitMap[x / 64];
long newval = previous | (1L << x);
BitMap[x / 64] = newval;
if (USE_BRANCHLESS) {
    cardinality += (previous ^ newval) >>> x;
} else if (previous != newval) {
    ++cardinality;
}
return this;
// RunContainer add (simplified)
int index = unsignedInterleavedBinarySearch(valueslength, 0, nbrruns, k);
if (index >= 0) return this; // already present
index = -index - 2;
// ... handle merging runs, extending runs, or inserting a new run ...
makeRoomAtIndex(index + 1);
setValue(index + 1, k);
setLength(index + 1, (short) 0);
return this;

5. Performance Comparison

A benchmark comparing Java's native BitSet with Roaring BitMap shows that, for one billion entries, Roaring BitMap uses only ~149 KB in its best‑case compressed form versus 128 MB for an uncompressed BitMap—a compression ratio of roughly 1:1000.

6. Java API Usage

Typical operations include:

// Add a single value
public void add(final int x);
// Add a range
public void add(final long rangeStart, final long rangeEnd);
// Remove a value
public void remove(final int x);
// Iterate
public void forEach(IntConsumer ic);
// Membership test
public boolean contains(final int x);
// Cardinality
public int getCardinality();
// Set operations (and, or, xor, andNot) – both mutable and immutable variants
public void and(final RoaringBitMap x2);
public static RoaringBitMap and(final RoaringBitMap x1, final RoaringBitMap x2);
// Serialization
public void serialize(DataOutput out) throws IOException;
public void deserialize(DataInput in) throws IOException;

7. Real‑World Scenario

In a production environment, Roaring BitMap can be used to build large‑scale user‑tag sets. For example, to compute the success rate of a device‑interface query, the intersection of the device‑tag set with the daily active‑user set is obtained via Roaring BitMap, reducing query time from minutes (SQL on Hive) to about one second.

8. Conclusion

The article consolidates the principles of BitMap and Roaring BitMap, explains the three container types, and demonstrates Java API usage. It highlights Roaring BitMap’s superior compression and performance, making it suitable for big‑data and database applications.

JavaBig DataBitmapdata compressioncontainersRoaring Bitmap
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.