Skip to content
Low Level Design Mastery Logo
LowLevelDesign Mastery

Bloom Filters

Space-efficient membership testing with probabilistic guarantees

In many systems, we need to quickly check if an element belongs to a set:

  • Cache lookup: Is this key in the cache?
  • Duplicate detection: Have we seen this request before?
  • Database query: Does this record exist?
  • Spell checking: Is this word in the dictionary?

Traditional Approach:

  • Use a hash set: O(1) lookup, but requires storing all elements
  • For 1 billion elements, need ~8GB of memory (assuming 8 bytes per element)

The Challenge: What if we need to check membership for billions of elements but can’t afford to store them all? What if we can tolerate occasional false positives but never false negatives?

A Bloom filter is a probabilistic data structure that provides space-efficient membership testing. It can tell you:

  • “Definitely not in set” - 100% accurate
  • “Possibly in set” - might be a false positive
Diagram

Key Properties:

  • Space-efficient: Uses only a few bits per element
  • Fast operations: O(k) time complexity (k = number of hash functions)
  • No false negatives: If filter says “no”, element is definitely not in set
  • False positives possible: If filter says “yes”, element might not be in set (need to verify)

Think of a Bloom filter like a guest list at a party:

  • The bit array is like a checklist with many boxes
  • Adding an element is like checking boxes for that guest’s name
  • Checking membership is like verifying if boxes are checked for a name
  • If boxes are checked, the guest might be on the list (could be coincidence)
  • If boxes are not checked, the guest is definitely not on the list
  1. Create a bit array of size m (all bits initialized to 0)
  2. Choose k independent hash functions
  3. Each hash function maps elements to positions in the bit array (0 to m-1)
  1. Hash the element with all k hash functions
  2. Set the k corresponding bits to 1
Diagram
  1. Hash the element with all k hash functions
  2. Check if all k corresponding bits are 1
  3. If all bits are 1: Element is possibly in the set (might be false positive)
  4. If any bit is 0: Element is definitely not in the set

Why False Positives Occur:

When checking membership, if all bits are set to 1, it could mean:

  • The element was added (correct)
  • Other elements happened to set those same bits (false positive)

Why No False Negatives:

If an element was added, all its bits were set to 1. If we check and find any bit is 0, the element was definitely not added.

The false positive rate depends on:

  • m: Size of bit array
  • k: Number of hash functions
  • n: Number of elements added

Optimal values:

  • m = -n × ln(p) / (ln(2)²) where p = desired false positive rate
  • k = (m/n) × ln(2)

Example:

  • 1 million elements, 1% false positive rate
  • m ≈ 9.6 million bits ≈ 1.2 MB
  • k ≈ 7 hash functions
  • Only ~1.2 bytes per element!

Before checking expensive cache (like Redis), use Bloom filter to quickly check if key might exist. Reduces cache lookup overhead.

Example: Check Bloom filter first. If “no”, skip cache lookup. If “yes”, check actual cache (might be false positive, but that’s okay).

Track which requests/items have been processed. Bloom filter says “definitely new” or “possibly seen”. For “possibly seen”, verify with actual storage.

Before expensive database query, check Bloom filter. If “definitely not”, skip query. If “possibly yes”, execute query.

Check if route exists before expensive routing table lookup. Bloom filter provides fast pre-filtering.

Standard Bloom filters can’t remove elements. Counting Bloom filter uses counters instead of bits, enabling deletion:

  • Each position stores a counter instead of a bit
  • Add: Increment counters
  • Remove: Decrement counters
  • Check: Verify all counters > 0

Trade-off: Uses more space (counters instead of bits) but enables deletion.

Advantages:

  • Space-efficient (few bits per element)
  • Fast operations (O(k) time)
  • No false negatives
  • Scales well

Disadvantages:

  • False positives possible
  • Can’t remove elements (unless counting Bloom filter)
  • Can’t list elements
  • Requires multiple hash functions

Space Efficient

Uses only a few bits per element. For 1M elements with 1% false positive rate, needs only ~1.2MB.

No False Negatives

If Bloom filter says “no”, element is definitely not in set. If says “yes”, might be false positive.

Fast Operations

Add and check operations are O(k) where k is number of hash functions (typically 3-10).

Probabilistic

Answers “possibly in set” or “definitely not in set”. False positives possible, false negatives impossible.