What is a Bloom Filter?
A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can tell you:
- "Definitely not in the set" — guaranteed correct (zero false negatives)
- "Probably in the set" — might be wrong (false positives are possible)
How it works
- Allocate a bit array of m bits, all initialized to 0.
- Choose k independent hash functions, each mapping an element to one of the m positions.
- Add(element): compute k hash positions, set all k bits to 1.
- MightContain(element): compute k hash positions. If all k bits are 1, return true (probably in set). If any bit is 0, return false (definitely not in set).
Why false positives happen
Different elements can set overlapping bits. After enough insertions, a query element might find all its k bits already set by other elements — a false positive.
Math
The false positive probability after inserting n elements into a filter with m bits and k hash functions is approximately:
p ≈ (1 - e^(-kn/m))^k
The optimal number of hash functions: k = (m/n) · ln(2)
Complexity
| Operation | Time | Space |
|---|---|---|
| Add | O(k) | O(m) |
| MightContain | O(k) | O(m) |
Bloom filters do not support deletion (unsetting a bit could affect other elements). Counting Bloom filters solve this by using counters instead of bits.
Real-Life: Web Crawler Deduplication
Google's web crawler needs to track billions of URLs it has already visited. Storing every URL in a hash set would require terabytes of RAM. Instead, it uses a Bloom filter:
- Before fetching a URL, check the Bloom filter. If it says "not seen", fetch and crawl it, then add it to the filter.
- If it says "probably seen", skip it. The occasional false positive just means a rarely re-crawled page — acceptable.
Other real-world uses:
- Databases (LSM trees): Before searching an SSTable on disk, check its Bloom filter. If the key is definitely not there, skip the expensive disk read. LevelDB, RocksDB, and Cassandra all use this.
- Chrome's Safe Browsing: your browser checks URLs against a local Bloom filter of known-malicious URLs before sending anything to Google's servers.
- Medium: uses Bloom filters to avoid recommending articles a user has already read.