Probability Algorithms in Big Data: BloomFilter and Count-min Sketch Applications
The article explains how space‑efficient probabilistic structures such as BloomFilter and Count‑min Sketch enable large‑scale data deduplication, join pruning, real‑time idempotent filtering, and approximate top‑K analytics by trading modest accuracy loss for dramatically reduced storage and faster computation.
As data volumes grow to billions or trillions of records, efficiently obtaining statistical results and deduplicated aggregate metrics with fewer resources and less time has become a frequent business requirement. Often, obtaining exact numbers is unnecessary; instead, understanding overall data trends or macro descriptive statistical features across dimensions is sufficient for decision-making and business effect evaluation. This has made research into probability algorithms a热门研究方向.
This article introduces two effective algorithms that evolved from HASH algorithms, discussing their applications in big data development:
1. BloomFilter
BloomFilter is a space-efficient probabilistic data structure based on HASH algorithms. It can represent large datasets using minimal storage by compressing data through multiple HASH functions.
Key characteristics:
If an element is not in the set, it definitely is not in the set (no false negatives)
If an element is in the set, it may or may not be in the set (possible false positives)
Compressed data is difficult to restore to original form
Strong compression ratio - millions of records can be represented in just KB or MB
Supports aggregation operations across different dimensions
Applications:
Deduplication metrics calculation: In offline scenarios, first map target objects to BF structure, then use BF characteristics to calculate deduplication across different dimensions. In Hive 2.7+, both standard BloomFilter and BloomKFilter implementations are available.
JOIN optimization: For scenarios like A LEFT JOIN B where both tables are large, use BF to filter A's JOIN keys that exist in B, reducing unnecessary network shuffle operations.
Real-time scenarios: Combined with distributed caching systems (or HBase) for metric calculation and message idempotent filtering.
2. Count-min Sketch
Count-min Sketch is a frequency estimation algorithm derived from BloomFilter. It uses multiple HASH functions to map input data to an array, recording the count of each hash collision. The algorithm returns the minimum value among all HASH results to reduce overestimation errors.
Key characteristics:
Minimal storage independent of data volume - tens of thousands of records require only tens of KB
Non-exact probabilistic calculation
Frequency results tend to be overestimated
Fast computation speed
Applications:
TOP-K ranking with acceptable error margins in both offline and real-time scenarios
Leaderboards, topic rankings, and complaint statistics
These algorithms trade some accuracy for significant improvements in storage efficiency and computational speed, making them valuable for large-scale data processing scenarios where approximate results are acceptable.
NetEase Yanxuan Technology Product Team
The NetEase Yanxuan Technology Product Team shares practical tech insights for the e‑commerce ecosystem. This official channel periodically publishes technical articles, team events, recruitment information, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.