Fundamentals 17 min read

Understanding Bloom Filters and Their Role in Preventing Cache Penetration

This article explains cache penetration, introduces Bloom filters as a space‑efficient probabilistic data structure, demonstrates how to implement them with Guava and Redis in Java, and discusses practical usage scenarios and performance considerations.

Architect

Nov 25, 2022

Understanding Bloom Filters and Their Role in Preventing Cache Penetration

Cache penetration occurs when repeated queries for non‑existent keys bypass the cache and hit the database, causing unnecessary load. A simple mitigation is to cache a placeholder (e.g., an empty string) for missing keys, but this fails when attackers use many distinct bogus keys.

Bloom filters provide a probabilistic way to test set membership with low memory overhead, allowing systems to quickly reject queries for keys that definitely do not exist while accepting a small false‑positive rate.

The article starts with an interview‑style question: how to determine whether an element exists among billions of unordered items. It introduces the bitmap (bit array) concept, where each position holds 0 (absent) or 1 (present), and explains that hash functions map elements to bitmap positions. Multiple hash functions reduce collision probability, balancing space and time costs.

Bloom filters were first described by Burton Bloom in 1970. The structure uses several hash functions to set bits in a bitmap; when querying, if all corresponding bits are 1 the element is possibly present, otherwise it is definitely absent. The article highlights false positives (when bits are set by other elements) and guarantees no false negatives.

Guava implementation

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>21.0</version>
</dependency>

// Insert count
private static final int insertions = 1000000;
// Desired false‑positive rate
private static final double fpp = 0.02;

public static void main(String[] args) {
    BloomFilter<String> bf = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), insertions, fpp);
    Set<String> sets = new HashSet<>(insertions);
    List<String> lists = new ArrayList<>(insertions);
    for (int i = 0; i < insertions; i++) {
        String uuid = UUID.randomUUID().toString();
        bf.put(uuid);
        sets.add(uuid);
        lists.add(uuid);
    }
    int rightNum = 0;
    int wrongNum = 0;
    for (int i = 0; i < 10000; i++) {
        String data = i % 100 == 0 ? lists.get(i / 100) : UUID.randomUUID().toString();
        if (bf.mightContain(data)) {
            if (sets.contains(data)) {
                rightNum++;
                continue;
            }
            wrongNum++;
        }
    }
    BigDecimal percent = new BigDecimal(wrongNum).divide(new BigDecimal(9900), 2, RoundingMode.HALF_UP);
    BigDecimal bingo = new BigDecimal(9900 - wrongNum).divide(new BigDecimal(9900), 2, RoundingMode.HALF_UP);
    System.out.println("In 1M elements, 100 actual exist: " + rightNum);
    System.out.println("In 1M elements, 9900 absent but reported present: " + wrongNum + ", hit rate: " + bingo + ", false‑positive rate: " + percent);
}

The test shows that all truly present elements are always reported as present, while a small number of absent elements (e.g., 216 out of 9900) are falsely reported, matching the configured false‑positive rate.

Redis implementation

/**
 * Bloom filter core class
 */
public class BloomFilterHelper<T> {
    private int numHashFunctions;
    private int bitSize;
    private Funnel<T> funnel;
    public BloomFilterHelper(int expectedInsertions) {
        this.funnel = (Funnel<T>) Funnels.stringFunnel(Charset.defaultCharset());
        bitSize = optimalNumOfBits(expectedInsertions, 0.03);
        numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, bitSize);
    }
    // ... (hash calculation methods omitted for brevity)
}

public class RedisBloomFilter<T> {
    @Autowired
    private RedisTemplate redisTemplate;
    public void add(BloomFilterHelper<T> helper, String key, T value) {
        int[] offset = helper.murmurHashOffset(value);
        for (int i : offset) {
            redisTemplate.opsForValue().setBit(key, i, true);
        }
    }
    public boolean contains(BloomFilterHelper<T> helper, String key, T value) {
        int[] offset = helper.murmurHashOffset(value);
        for (int i : offset) {
            if (!redisTemplate.opsForValue().getBit(key, i)) {
                return false;
            }
        }
        return true;
    }
    // Additional batch methods using pipelining omitted for brevity
}

The Redis version stores the bitmap directly in Redis, allowing distributed systems to share the same filter. The article notes the performance difference between single‑bit setBit calls and batch pipelined inserts.

Finally, the article lists common application scenarios for Bloom filters, such as URL deduplication in web crawlers, spam detection, malicious URL detection in browsers, recommendation systems, and reducing unnecessary lookups in storage systems like BigTable and Cassandra.

In summary, Bloom filters are a lightweight, probabilistic tool that can effectively mitigate cache penetration and many other large‑scale membership‑testing problems, provided that the false‑positive rate is understood and configured appropriately.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Redis Guava Data Structures bloom-filter cache-penetration

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.