How to Deduplicate 4 Billion QQ Numbers Using Only 1 GB of Memory
This article walks through four practical techniques—sorting, hashmap, file splitting, and bitmap—to remove duplicate QQ numbers from a 4‑billion‑record file within a 1 GB memory limit, and provides extended exercises for sorting, median, top‑K, and duplicate detection.
Hello, I'm NiuNiu. Today we discuss a common interview problem that appears in Tencent's third‑round interview: given a file containing 4 billion QQ numbers, design an algorithm to deduplicate them while keeping only one copy of each, with a memory limit of 1 GB.
Method 1: Sorting
The simplest idea is to sort all QQ numbers; duplicates become adjacent, so we keep the first occurrence and discard the rest.
Original QQ numbers:
Sorted QQ numbers:
Deduplicated result:
However, sorting's time complexity is too high to pass the interview.
Method 2: HashMap
Using a hashmap to record each QQ number eliminates duplicates automatically.
mapFlag[123] = true mapFlag[567] = true mapFlag[123] = true mapFlag[890] = trueAfter inserting, the hashmap contains only unique keys:
mapFlag[123] = true mapFlag[567] = true mapFlag[890] = trueBut storing 4 billion entries exceeds the 1 GB memory limit.
Method 3: File Splitting
For massive data, one may split the file into smaller chunks and process each separately, using external merge sort or bucket sort. Yet this still incurs high I/O overhead and may not meet the interview constraints.
Method 4: Bitmap
Optimizing the hashmap idea, we can use a bitmap to represent the existence of each possible QQ number.
An unsigned char (8 bits) can represent numbers 0‑7; a value of 255 means all eight numbers exist, while 254 indicates numbers 1‑7 exist but 0 does not.
Similarly, an unsigned int (32 bits) can represent numbers 0‑31, and two unsigned ints can cover 0‑63.
Since a QQ number fits within 32 bits (max 2^32‑1 ≈ 4.3 billion), a bitmap of 512 MB is sufficient to mark the presence of all possible QQ numbers.
Implementation example:
bitmapFlag[123] = 1 bitmapFlag[567] = 1 bitmapFlag[890] = 1Scanning the bitmap from low to high and outputting indices with value 1 yields a deduplicated, sorted list, which satisfies the interview requirements.
Extended Exercise 1
Design an algorithm to sort 4 billion distinct QQ numbers with 1 GB memory. Using the bitmap to mark existence automatically yields a sorted output when scanning the bitmap.
Extended Exercise 2
Find the median of 4 billion distinct QQ numbers under the same memory constraint. The bitmap approach again provides a direct way to locate the middle element.
Extended Exercise 3
Retrieve the top‑K QQ numbers from the 4 billion distinct entries using 1 GB memory. After bitmap construction, scanning from the highest index and collecting the first K hits solves the problem.
Extended Exercise 4
Given 8 billion QQ numbers, determine whether any duplicates exist with 1 GB memory. By the pigeonhole principle, because the total possible distinct QQ numbers are about 4.3 billion, duplicates must exist.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
NiuNiu MaTe
Joined Tencent (nicknamed "Goose Factory") through campus recruitment at a second‑tier university. Career path: Tencent → foreign firm → ByteDance → Tencent. Started as an interviewer at the foreign firm and hopes to help others.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
