Bloom Filters
A probabilistic set that answers "definitely NOT in set" or "POSSIBLY in set" — never false negatives, allows false positives.
A probabilistic set that answers “definitely NOT in set” or “POSSIBLY in set” — never false negatives, allows false positives.
False Positive Rate — Derivation
The false positive probability after inserting n elements into a filter of m bits using k hash functions:
FPR ≈ (1 - e^(-kn/m))^k
Intuition: After inserting one element, probability a specific bit is still 0 is (1 - 1/m). After kn hash operations (k hashes × n elements), probability it’s still 0 is (1 - 1/m)^(kn) ≈ e^(-kn/m). A false positive requires all k queried bits to be 1, so FPR = (1 - e^(-kn/m))^k.
Optimal parameters:
- Optimal
k(hash functions) =(m/n) × ln(2) ≈ 0.693 × (m/n) - For 1% FPR: ~9.6 bits per element (independent of dataset size)
- General formula:
m ≈ -1.44 × n × log₂(FPR)
How Real Systems Use This
Cassandra
Cassandra maintains a per-SSTable Bloom filter configured via bloom_filter_fp_chance (default 0.1, or 10% false positive rate). When reading a key, Cassandra first checks the Bloom filter; if the filter returns “definitely not present,” the system skips disk I/O entirely and returns “not found.” If the filter returns “maybe present,” Cassandra performs a disk lookup. This single optimization eliminates 90% of unnecessary disk seeks for reads that miss the dataset, dramatically improving read performance for negative lookups. The false positive rate is tunable per table based on workload characteristics.
Chrome Safe Browsing
Google Chrome maintains a local Bloom filter of malicious URLs and phishing sites. When you visit a website, Chrome immediately checks the local Bloom filter for the domain. If the filter returns “definitely not dangerous,” the page loads normally. If “maybe dangerous,” Chrome makes a quick server request to double-check against Google’s authoritative database. This design avoids contacting Google’s servers for 99%+ of legitimate sites, reducing latency and privacy concerns. The local filter is periodically updated with new malicious URLs.
LevelDB / RocksDB
LevelDB and RocksDB (Google’s embedded key-value store) create a Bloom filter for each SSTable file with a configurable bits-per-key parameter (default 10). During range queries or point lookups, the database checks the Bloom filter first; a negative result immediately proves the key is not in that file, avoiding expensive disk seeks. With tens of thousands of SSTables, this optimization is critical for write-heavy workloads where most queries miss the dataset. The bits-per-key parameter trades space for false positive rate: 10 bits/key achieves ~1% FPR.
Medium
Medium uses Bloom filters to avoid recommending articles users have already read. When computing personalized recommendations, the system queries a Bloom filter of each user’s read history (millions of articles per user, stored compactly). If the filter says “user has NOT read this article,” it’s safe to recommend. If “maybe read,” Medium performs a precise lookup in the primary database. This avoids expensive database queries for 99%+ of articles, reducing recommendation latency from seconds to milliseconds.
Bitcoin SPV Nodes
Bitcoin Simplified Payment Verification (SPV) nodes use Bloom filters to request relevant transactions from full nodes without revealing their entire address set. An SPV node sends a Bloom filter of its watched addresses to peers; peers then send only transactions matching the filter. This trades off privacy (peer learns approximately which transactions matter to the node) for bandwidth: SPV nodes download only ~2% of the blockchain instead of the full 400GB+. The false positive rate is configurable to balance privacy and bandwidth.
Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import hashlib
class BloomFilter:
def __init__(self, size, num_hashes):
"""
Initialize a Bloom filter.
Args:
size: number of bits in the filter (larger = lower FPR)
num_hashes: number of independent hash functions (typically 3-7)
"""
self.size = size
self.num_hashes = num_hashes
self.bit_array = [0] * size
def _hash(self, item, seed):
"""Generate a hash position for an item using a seeded hash."""
h = hashlib.md5(f"{item}_{seed}".encode()).hexdigest()
return int(h, 16) % self.size
def add(self, item):
"""Add an item to the Bloom filter."""
for i in range(self.num_hashes):
pos = self._hash(item, i)
self.bit_array[pos] = 1 # Set bit to 1
def contains(self, item):
"""
Check if an item is in the filter.
Returns:
True if item is MAYBE in the filter (possibly false positive)
False if item is DEFINITELY NOT in the filter (never false negative)
"""
for i in range(self.num_hashes):
pos = self._hash(item, i)
if self.bit_array[pos] == 0:
# Found a 0 bit -> item definitely not in filter
return False
# All bits are 1 -> item might be in filter
return True
# Example usage
if __name__ == "__main__":
bf = BloomFilter(size=100, num_hashes=3)
# Add some items
items_added = ["apple", "banana", "orange"]
for item in items_added:
bf.add(item)
print(f"Added: {item}")
# Check for items (with false positive possibility)
print("\nMembership checks:")
for item in ["apple", "grape", "banana"]:
result = bf.contains(item)
print(f" {item}: {'possibly in' if result else 'definitely NOT in'} filter")
# Display bit array (first 50 bits)
print(f"\nBit array (first 50): {bf.bit_array[:50]}")
Double Hashing Optimization
The implementation above uses seeded MD5, but production systems use the double hashing technique (Kirsch-Mitzenmacher 2006) — generating k hash positions from just 2 base hash functions:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import struct
import hashlib
def double_hash_bloom(item, k, m):
"""
Generate k hash positions using double hashing.
gi(x) = h1(x) + i * h2(x) mod m
Only 2 hash computations instead of k.
"""
raw = hashlib.md5(item.encode()).digest()
h1 = struct.unpack('<Q', raw[:8])[0] # First 8 bytes as uint64
h2 = struct.unpack('<Q', raw[8:16])[0] # Last 8 bytes as uint64
positions = []
for i in range(k):
pos = (h1 + i * h2) % m
positions.append(pos)
return positions
In production, replace MD5 with MurmurHash3 (128-bit variant gives you h1 and h2 in one call) or xxHash for maximum throughput.
References
- Space/Time Trade-offs in Hash Coding with Allowable Errors — Burton H. Bloom (1970) — Original paper introducing Bloom filters; foundational for understanding false positive guarantees.
- Less Hashing, Same Performance — Kirsch & Mitzenmacher (2006) — Proves double hashing with 2 functions matches k independent functions.
- ByteByteGo — Bloom Filters — Clear visual explanation of insert/query operations with real-world system examples.
- Network Applications of Bloom Filters: A Survey — Broder & Mitzenmacher (2005) — Comprehensive survey covering CDNs, P2P networks, routing, and caching applications.
- Wikipedia: Bloom Filter — Quick reference with FPR formula, variants (Counting Bloom Filter), and algorithm pseudo-code.
- LevelDB Architecture — Google Documentation — Details on SSTable Bloom filters and bits-per-key configuration.