Post

Bloom Filters

A probabilistic set that answers "definitely NOT in set" or "POSSIBLY in set" — never false negatives, allows false positives.

Bloom Filters

A probabilistic set that answers “definitely NOT in set” or “POSSIBLY in set” — never false negatives, allows false positives.

False Positive Rate — Derivation

The false positive probability after inserting n elements into a filter of m bits using k hash functions:

FPR ≈ (1 - e^(-kn/m))^k

Intuition: After inserting one element, probability a specific bit is still 0 is (1 - 1/m). After kn hash operations (k hashes × n elements), probability it’s still 0 is (1 - 1/m)^(kn) ≈ e^(-kn/m). A false positive requires all k queried bits to be 1, so FPR = (1 - e^(-kn/m))^k.

Optimal parameters:

  • Optimal k (hash functions) = (m/n) × ln(2) ≈ 0.693 × (m/n)
  • For 1% FPR: ~9.6 bits per element (independent of dataset size)
  • General formula: m ≈ -1.44 × n × log₂(FPR)

How Real Systems Use This

Cassandra

Cassandra maintains a per-SSTable Bloom filter configured via bloom_filter_fp_chance (default 0.1, or 10% false positive rate). When reading a key, Cassandra first checks the Bloom filter; if the filter returns “definitely not present,” the system skips disk I/O entirely and returns “not found.” If the filter returns “maybe present,” Cassandra performs a disk lookup. This single optimization eliminates 90% of unnecessary disk seeks for reads that miss the dataset, dramatically improving read performance for negative lookups. The false positive rate is tunable per table based on workload characteristics.

Chrome Safe Browsing

Google Chrome maintains a local Bloom filter of malicious URLs and phishing sites. When you visit a website, Chrome immediately checks the local Bloom filter for the domain. If the filter returns “definitely not dangerous,” the page loads normally. If “maybe dangerous,” Chrome makes a quick server request to double-check against Google’s authoritative database. This design avoids contacting Google’s servers for 99%+ of legitimate sites, reducing latency and privacy concerns. The local filter is periodically updated with new malicious URLs.

LevelDB / RocksDB

LevelDB and RocksDB (Google’s embedded key-value store) create a Bloom filter for each SSTable file with a configurable bits-per-key parameter (default 10). During range queries or point lookups, the database checks the Bloom filter first; a negative result immediately proves the key is not in that file, avoiding expensive disk seeks. With tens of thousands of SSTables, this optimization is critical for write-heavy workloads where most queries miss the dataset. The bits-per-key parameter trades space for false positive rate: 10 bits/key achieves ~1% FPR.

Medium

Medium uses Bloom filters to avoid recommending articles users have already read. When computing personalized recommendations, the system queries a Bloom filter of each user’s read history (millions of articles per user, stored compactly). If the filter says “user has NOT read this article,” it’s safe to recommend. If “maybe read,” Medium performs a precise lookup in the primary database. This avoids expensive database queries for 99%+ of articles, reducing recommendation latency from seconds to milliseconds.

Bitcoin SPV Nodes

Bitcoin Simplified Payment Verification (SPV) nodes use Bloom filters to request relevant transactions from full nodes without revealing their entire address set. An SPV node sends a Bloom filter of its watched addresses to peers; peers then send only transactions matching the filter. This trades off privacy (peer learns approximately which transactions matter to the node) for bandwidth: SPV nodes download only ~2% of the blockchain instead of the full 400GB+. The false positive rate is configurable to balance privacy and bandwidth.

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import hashlib

class BloomFilter:
    def __init__(self, size, num_hashes):
        """
        Initialize a Bloom filter.

        Args:
            size: number of bits in the filter (larger = lower FPR)
            num_hashes: number of independent hash functions (typically 3-7)
        """
        self.size = size
        self.num_hashes = num_hashes
        self.bit_array = [0] * size

    def _hash(self, item, seed):
        """Generate a hash position for an item using a seeded hash."""
        h = hashlib.md5(f"{item}_{seed}".encode()).hexdigest()
        return int(h, 16) % self.size

    def add(self, item):
        """Add an item to the Bloom filter."""
        for i in range(self.num_hashes):
            pos = self._hash(item, i)
            self.bit_array[pos] = 1  # Set bit to 1

    def contains(self, item):
        """
        Check if an item is in the filter.

        Returns:
            True if item is MAYBE in the filter (possibly false positive)
            False if item is DEFINITELY NOT in the filter (never false negative)
        """
        for i in range(self.num_hashes):
            pos = self._hash(item, i)
            if self.bit_array[pos] == 0:
                # Found a 0 bit -> item definitely not in filter
                return False

        # All bits are 1 -> item might be in filter
        return True

# Example usage
if __name__ == "__main__":
    bf = BloomFilter(size=100, num_hashes=3)

    # Add some items
    items_added = ["apple", "banana", "orange"]
    for item in items_added:
        bf.add(item)
        print(f"Added: {item}")

    # Check for items (with false positive possibility)
    print("\nMembership checks:")
    for item in ["apple", "grape", "banana"]:
        result = bf.contains(item)
        print(f"  {item}: {'possibly in' if result else 'definitely NOT in'} filter")

    # Display bit array (first 50 bits)
    print(f"\nBit array (first 50): {bf.bit_array[:50]}")

Double Hashing Optimization

The implementation above uses seeded MD5, but production systems use the double hashing technique (Kirsch-Mitzenmacher 2006) — generating k hash positions from just 2 base hash functions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import struct
import hashlib

def double_hash_bloom(item, k, m):
    """
    Generate k hash positions using double hashing.
    gi(x) = h1(x) + i * h2(x) mod m

    Only 2 hash computations instead of k.
    """
    raw = hashlib.md5(item.encode()).digest()
    h1 = struct.unpack('<Q', raw[:8])[0]   # First 8 bytes as uint64
    h2 = struct.unpack('<Q', raw[8:16])[0]  # Last 8 bytes as uint64

    positions = []
    for i in range(k):
        pos = (h1 + i * h2) % m
        positions.append(pos)
    return positions

In production, replace MD5 with MurmurHash3 (128-bit variant gives you h1 and h2 in one call) or xxHash for maximum throughput.

References

This post is licensed under CC BY 4.0 by the author.