Using Bitmap and Bloom Filter for Large-Scale Data Deduplication in Java
The article explains how to store and deduplicate billions of identifiers efficiently by using a bitmap backed by Redis and extending it with a Bloom filter implementation in Java, highlighting memory calculations, practical commands, and code examples.
Before diving into solutions, the article calculates the memory required to store 4 billion QQ numbers using an unsigned 32‑bit integer, which would need roughly 15 GB, noting that Java does not provide native unsigned integer types.
To handle such massive data more elegantly, the article introduces the bitmap (位图) data structure, which uses a single bit per possible value, allowing constant‑time existence checks and drastically reducing memory consumption.
For example, to determine whether the QQ number 2924357571 exists, one simply checks the bit at index 2924357571 ; a value of 1 indicates presence.
Covering the full 32‑bit range (0 to 2^32‑1) requires 2^32 bits, i.e., about 512 MB of memory (2^32 / 8 / 1024 / 1024 ≈ 512 MB), which is far more space‑efficient than storing full integers.
In Java projects, bitmaps are often managed via Redis, using common bitmap commands such as SETBIT , GETBIT , and BITCOUNT (illustrated by the included image of Redis command references).
The article also lists other practical applications of bitmaps: large‑scale deduplication (e.g., order IDs, URL crawling), real‑time statistics (online user counts, video view tracking), and as the foundation for Bloom filters.
Bloom filters extend the bitmap idea by applying multiple hash functions; only when all corresponding bits are set to 1 does the filter claim the element exists. The article shows the BitMapBloomFilter class from the Hutool library, which provides five default hash functions, and includes the full constructor code wrapped in a pre block.
Advantages of Bloom filters over plain bitmaps include the ability to handle arbitrary strings or objects via hashing, while disadvantages involve false positives due to hash collisions, leading to a controllable error rate.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.