Part 1: Motivation

How do we check if something is in a set — fast?

The simplest way is a List:

if x in items:
  ...

But this is O(n) — too slow for large-scale systems.

A HashSet improves to O(1) lookups on average, but it stores the full elements, requiring more memory than raw data — especially for strings or objects.

So what if we trade a little accuracy for massive savings?

What if a structure could:

  • Use only ~9.6 bits per element (significantly smaller than storing object or string),

  • Be wrong only 1% of the time (false positives),

  • And never say “no” to something that’s truly there?

That’s the power of the Bloom Filter.

Where is it used?

Bloom filters are quietly at the heart of many systems:

  • LSM trees, the foundation of modern NoSQL databases like Apache Cassandra, MongoDB, use Bloom filters to skip disk reads — asking:

“Does this file probably contain the key?”

  • Platforms like Quora, Medium, and Yahoo use Bloom filters to:

    • Prevent duplicate content,

    • Avoid redundant processing,

    • Speed up internal caching systems.

Even if you don’t see them — Bloom filters are working behind the scenes, making large systems fast and efficient.

Part 2: What is a false positive? (Definition + Example)

What is a false positive?

✅ When you check an element that was inserted, the Bloom filter says “yes” — that’s a true positive, and it’s 100% guaranteed correct.

⚠️ But sometimes it says “yes” to something that was never inserted — that’s a false positive.

A false positive happens when a new element matches the bit pattern of others — even though it was never added.

The filter sees all bits set and says “Probably yes” — but it’s wrong.

That’s the trade-off for speed and space — and you’ll see it clearly in the next example.

How does it work?

A Bloom filter is built with:

  • m: a bit array of length m, all bits start at 0

  • k: k independent hash functions

  • n: number of elements inserted

To insert an element O(1):

  • Hash it using all k functions

  • Flip the corresponding k bits to 1

To check membership O(1):

  • Hash the element using the same k functions

  • If any of the k bits is 0 → the element is definitely not in the set

  • If all are 1 → the element is possibly in the set (could be a false positive)

Example: When a False Positive Happens

Let’s step through a small Bloom filter:

Setup:

  • m = 10 (bit array of size 10)

  • k = 3 hash functions

  • Inserted elements: ‘apple’ and ‘banana’ so n = 2

Start with an empty bit array:

Initial: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Step 1: Insert ‘apple’

Let’s say:

  • h₁(apple) = 2

  • h₂(apple) = 4

  • h₃(apple) = 7

Update the bit array:

After insert apple: [0, 0, 1, 0, 1, 0, 0, 1, 0, 0]

Step 2: Insert ‘banana’

Let’s say:

  • h₁(banana) = 1

  • h₂(banana) = 4

  • h₃(banana) = 8

Update the bit array:

After insert banana: [0, 1, 1, 0, 1, 0, 0, 1, 1, 0]

Step 3: Check ‘banana’ (True Positive)

  • h₁(banana) = 1 → bit is 1

  • h₂(banana) = 4 → bit is 1

  • h₃(banana) = 8 → bit is 1

All bits are 1 → Bloom filter says “Yes” → ✅ correct!

Step 4: Check ‘mango’ (False Positive)

  • h₁(mango) = 2

  • h₂(mango) = 4

  • h₃(mango) = 7

Check bits:

  • 2 → 1 ✅

  • 4 → 1 ✅

  • 7 → 1 ✅

Bloom filter says “Yes”, but ‘mango’ was never inserted → ❌ false positive

This happens because ‘apple’ and ‘banana’ already set those bits.

Part 3: Mathematical Proof

Okay — now you understand what a false positive is. Let’s take it a step deeper and explore the probability theory behind Bloom filters.

Don’t worry — it only uses basic math that every student learns in university. We’ll walk through this step-by-step, keeping things intuitive.

What is our goal?

We want to answer:

What’s the probability that a new element (not in the set) returns a false positive?

That means: all the k bits checked during the query are already 1— even though the element was never inserted, not even once among the n inserted items.

Step 1: Probability a Bit is Still 0 After One Flip

Suppose we have an array of m bits, all starting as 0. Now we flip 1 random bit to 1. The chance that a specific bit stays 0 is:

Why? Because we only had a 1/m chance to hit it.

Step 2: We Perform k × n Bit Flips

We insert n elements, each hashed with k functions. So we flip bits k × n times.

The probability that a specific bit is still 0 after all those flips is:

Step 3: Exponential Limit Approximation

When m is large and kn is not too huge, we can approximate this with the exponential function:

This comes from the identity:

Step 4: Probability Bit is 1 (i.e. Flipped at Least Once)

This tells us how likely a bit is to be 1 after inserting n elements.

Step 5: False Positive = All k Bits Are 1

Now, suppose we query a new element that wasn’t inserted. Its k hash functions give us k bit positions. The probability that all those k bits are already 1 — just by chance — is:

And that’s the final formula for Bloom filter false positives.

Bloom Filter Example: 1 Million Items (n = 1,000,000)

Size per element (m/n) Total Size Hash Functions (k) False Positive Rate
6.25 bits 0.78 MB 4 4.99%
7.5 bits 0.94 MB 5 2.70%
9.58 bits 1.20 MB 6 1.00%
12.5 bits 1.56 MB 8 0.25%

Part 4: What’s Next — Other Smart Probabilistic Structures for Big Data Problems

Bloom filters solve set membership efficiently — but what about other fundamental questions in computer science?

  • How many unique users have viewed this video? → HyperLogLog

  • How many times did user X access this page? → Count-Min Sketch

  • What are the 50th, 90th, and 99th percentiles of the measured latencies? → t-digest

Want more? → Explore more probabilistic data structures.

References