Back to DAG

Bloom Filter

databases

What is a Bloom Filter?

A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can tell you:

  • "Definitely not in the set" — guaranteed correct (zero false negatives)
  • "Probably in the set" — might be wrong (false positives are possible)

How it works

  1. Allocate a bit array of m bits, all initialized to 0.
  2. Choose k independent hash functions, each mapping an element to one of the m positions.
  3. Add(element): compute k hash positions, set all k bits to 1.
  4. MightContain(element): compute k hash positions. If all k bits are 1, return true (probably in set). If any bit is 0, return false (definitely not in set).

Why false positives happen

Different elements can set overlapping bits. After enough insertions, a query element might find all its k bits already set by other elements — a false positive.

Math

The false positive probability after inserting n elements into a filter with m bits and k hash functions is approximately:

p ≈ (1 - e^(-kn/m))^k

The optimal number of hash functions: k = (m/n) · ln(2)

Complexity

OperationTimeSpace
AddO(k)O(m)
MightContainO(k)O(m)

Bloom filters do not support deletion (unsetting a bit could affect other elements). Counting Bloom filters solve this by using counters instead of bits.

Real-Life: Web Crawler Deduplication

Real-World Example

Google's web crawler needs to track billions of URLs it has already visited. Storing every URL in a hash set would require terabytes of RAM. Instead, it uses a Bloom filter:

  • Before fetching a URL, check the Bloom filter. If it says "not seen", fetch and crawl it, then add it to the filter.
  • If it says "probably seen", skip it. The occasional false positive just means a rarely re-crawled page — acceptable.

Other real-world uses:

  • Databases (LSM trees): Before searching an SSTable on disk, check its Bloom filter. If the key is definitely not there, skip the expensive disk read. LevelDB, RocksDB, and Cassandra all use this.
  • Chrome's Safe Browsing: your browser checks URLs against a local Bloom filter of known-malicious URLs before sending anything to Google's servers.
  • Medium: uses Bloom filters to avoid recommending articles a user has already read.

Bloom Filter: Add and Query

Bit Array (m=16) 0 11 2 13 4 5 16 7 8 19 10 111 12 13 14 15 add("cat") h1=1, h2=6, h3=11 add("dog") h1=3, h2=9, h3=11 query("fox") h1=1, h2=3, h3=8 Bit 8 = 0 → DEFINITELY NOT in set query("rat") h1=1, h2=6, h3=3 (all bits happen to be 1) All bits = 1 → PROBABLY in set (false positive!)

Interactive Bloom Filter

Loading demo...
Step 1 of 3