How to Deduplicate 4 Billion QQ Numbers Using a Bitmap Under 1 GB
This article explains how to efficiently remove duplicates from 4 billion QQ numbers within a 1 GB memory limit by replacing the naïve HashSet approach with a memory‑saving Bitmap data structure, complete with calculations, algorithm steps, Java code, and a discussion of its pros and cons.
1. Conventional Approach
In everyday development, the first thought for deduplication is to use a
HashSet, which automatically removes duplicates:
<code>Set<Long> qqSet = new HashSet<>();
qqSet.add(qqNumber); // automatic deduplication</code>However, with a 1 GB memory limit, storing 4 billion QQ numbers in a
HashSetis infeasible. Assuming each QQ number occupies 8 bytes (64‑bit integer), the total memory required is about 30 GB, far exceeding the limit.
2. Bitmap Solution
2.1 What is a Bitmap?
A Bitmap is a highly efficient data structure that uses a single bit to indicate the presence of a number, making it ideal for large‑scale deduplication and query problems.
For example, a Bitmap of length 10 can represent numbers 0‑9, where each bit set to 1 means the corresponding number exists.
If the 0th bit is 1, number 0 exists.
If the 1st bit is 1, number 1 exists.
If the 2nd bit is 1, number 2 exists.
…and so on.
Images illustrate the bitmap representation:
When recording QQ numbers 9, 5, 2, 7, the bitmap looks like this:
Using a traditional 4‑byte integer for each number would require 16 bytes for four numbers, whereas the bitmap needs only 10 bits, demonstrating significant space savings.
2.2 Applying Bitmap to 4 Billion QQ Numbers
2.2.1 Memory Estimate
Creating a bitmap with 4 billion bits requires:
<code>4,000,000,000 / 8 = 500,000,000 bytes
500,000,000 / 1024 / 1024 / 1024 ≈ 0.466 GB</code>This fits comfortably within the 1 GB limit.
2.2.2 Deduplication Process
Initialize a bitmap with 4 billion bits.
Iterate over each QQ number, map it to a bit position, and set that bit to 1.
For example, QQ number 326443281 sets the corresponding bit to 1.
Traverse the bitmap and collect all indices whose bits are 1; these indices correspond to the unique QQ numbers.
2.3 Simple Java Implementation
The following code demonstrates bitmap‑based deduplication:
<code>import java.util.*;
public class QQDeduplication {
// Bitmap size: 2^32 bits ≈ 0.5 GB
private static final long BITMAP_SIZE = 1L << 32;
private static final int BYTE_SIZE = 8; // bits per byte
private static List<Long> deduplicateQQNumbers(long[] qqNumbers) {
byte[] bitmap = new byte[(int) (BITMAP_SIZE / BYTE_SIZE)];
for (long qqNumber : qqNumbers) {
if (qqNumber >= 0 && qqNumber < BITMAP_SIZE) {
int index = (int) (qqNumber / BYTE_SIZE);
int bitPosition = (int) (qqNumber % BYTE_SIZE);
bitmap[index] |= (1 << bitPosition);
}
}
List<Long> uniqueQQNumbers = new ArrayList<>();
for (int i = 0; i < bitmap.length; i++) {
for (int j = 0; j < BYTE_SIZE; j++) {
if ((bitmap[i] & (1 << j)) != 0) {
long qqNumber = (long) i * BYTE_SIZE + j;
uniqueQQNumbers.add(qqNumber);
}
}
}
return uniqueQQNumbers;
}
}
</code>2.4 Advantages and Disadvantages of Bitmap
Advantages
High space efficiency – only 1 bit per element.
Compared with a hash table, Bitmap uses just 1 bit per element, offering excellent utilization for dense data such as sequential QQ numbers.
Fast operations – insertion and query are O(1) using bitwise operations.
Both setting and checking a bit are constant‑time operations, suitable for real‑time processing of massive data.
Simplified deduplication logic – just set bits and later read the positions.
No complex data structures are needed; a single pass suffices.
Disadvantages
Memory consumption depends on the value range.
If the range is huge but sparse (e.g., QQ numbers up to 1 trillion), the bitmap would require prohibitive memory (≈125 GB).
Cannot store additional information.
Bitmap only records presence, not counts or other metadata.
Conclusion
While a Bloom filter could also achieve sub‑1 GB deduplication for 4 billion QQ numbers, the bitmap approach offers deterministic results and straightforward implementation.
macrozheng
Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.