Understanding Bloom Filters: Principles, Applications, and Java Implementations with Guava and Redis
This article explains Bloom filters, their core principles, typical use cases like cache penetration and large‑scale membership testing, and demonstrates practical Java implementations using Guava and Redis, including code examples, performance analysis, and discussion of their advantages and limitations.
This article introduces Bloom filters from a beginner’s perspective, describing the motivation behind the data structure and its growing popularity in modern software systems.
Typical application scenarios are presented, such as preventing cache penetration in Redis‑based caches and efficiently testing membership when the data set size far exceeds available memory.
A Bloom filter is defined as a long binary vector where each bit is either 0 or 1. Elements are added by hashing them with several independent hash functions and setting the corresponding bits to 1. Queries check whether all those bits are set, yielding fast O(k) checks with a controllable false‑positive rate.
The article explains the insertion process (setting bits for each hash), the query process (checking bits), and why deletions are problematic because multiple elements may share the same bits, leading to potential false deletions.
Advantages: very low memory consumption, fast insert and query operations.
Disadvantages: false‑positive rate grows with the number of inserted items, cannot delete elements reliably, and can only guarantee that an element is definitely not present.
Below is a Guava‑based Java implementation of a Bloom filter. The Maven dependency is added first, followed by example code that creates a filter, inserts one million integers, and measures the false‑positive count.
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>19.0</version>
</dependency>
private static int size = 1000000; // expected insertions
private static double fpp = 0.01; // desired false‑positive probability
private static BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), size, fpp);
public static void main(String[] args) {
// insert data
for (int i = 0; i < 1000000; i++) {
bloomFilter.put(i);
}
int count = 0;
// test for false positives
for (int i = 1000000; i < 2000000; i++) {
if (bloomFilter.mightContain(i)) {
count++;
System.out.println(i + "误判了");
}
}
System.out.println("总共的误判数:" + count);
}The code defines the expected number of elements and the target false‑positive rate, creates the filter, inserts a range of integers, and then checks another range to count how many false positives occur, printing the result.
A Redis‑based implementation follows, using a bitmap (stored as a Redis string) to simulate the Bloom filter. Custom hash functions are defined, and the bitmap length and number of hash functions are calculated based on the expected insertions and false‑positive rate.
public class RedisMain {
static final int expectedInsertions = 100;
static final double fpp = 0.01;
private static long numBits;
private static int numHashFunctions;
static {
numBits = optimalNumOfBits(expectedInsertions, fpp);
numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
}
public static void main(String[] args) {
Jedis jedis = new Jedis("192.168.0.109", 6379);
for (int i = 0; i < 100; i++) {
long[] indexs = getIndexs(String.valueOf(i));
for (long index : indexs) {
jedis.setbit("codebear:bloom", index, true);
}
}
// query phase omitted for brevity
}
private static long[] getIndexs(String key) { /* hash calculations */ }
private static long hash(String key) { /* murmur3 hash */ }
private static int optimalNumOfHashFunctions(long n, long m) { /* formula */ }
private static long optimalNumOfBits(long n, double p) { /* formula */ }
}The Redis version manually computes hash indices, sets bits in the bitmap with SETBIT , and later reads them with GETBIT to determine possible membership, demonstrating how a Bloom filter can be shared across multiple services.
Test results show that the Guava implementation produces a false‑positive count close to the expected 1% rate, while the Redis version correctly flags potential duplicates. The article concludes that Bloom filters are simple yet powerful tools for memory‑efficient membership testing, but their parameters must be carefully chosen to balance space usage and error rate.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.