HyperLogLog
Count distinct elements in a stream using O(log log n) memory -- with ~2% error rate, vs O(n) for an exact set.
Core Idea
Intuition: Rare event (many leading zeros) -> large number of distinct elements seen. Track the maximum rarity seen per bucket.
Accuracy vs Memory
| Approach | Memory | Error |
|---|---|---|
| Exact hash set | O(n) | 0% |
| HyperLogLog (b=10) | ~1 KB | ~2% |
| HyperLogLog (b=14) | ~16 KB | ~0.8% |
| HyperLogLog++ | ~2.5 KB | ~1% |
Key Properties
| Property | Value |
|---|---|
| Memory | O(log log n) |
| Typical error | 0.81% to 2% |
| Insert | O(1) |
| Query (estimate) | O(1) |
| Merge two HLLs | O(m) — union without re-scanning |
| Deletions | Not supported |
Highlights
- Streaming cardinality: Process elements one-by-one, never store all data
- Merge-friendly: Combine HLL counts from multiple streams (shards)
- Tiny footprint: Seconds of storage for millions of distinct elements
- Math-driven: Probabilistic guarantees with known error bounds
When to Use / Avoid
- Use to count unique visitors / users at scale
- Use for distinct query counts in analytics
- Use when memory is constrained, ~2% error acceptable
- Use when you need to merge counts across shards
- Avoid when you need exact counts
- Avoid for small datasets (just use a hash set)
Real Systems
| System | How Used |
|---|---|
| Redis | PFADD / PFCOUNT commands |
| PostgreSQL | approx_count_distinct() |
| Presto / Spark | Approximate DISTINCT aggregation |
| Trending topic cardinality | |
| Google Analytics | Unique visitor estimation |
Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
import hashlib
import struct
from math import log2, pow
class HyperLogLog:
"""
HyperLogLog: Probabilistic cardinality estimation.
Uses O(log log n) memory to estimate distinct elements with ~0.81% error.
"""
def __init__(self, precision=14):
"""
Initialize HyperLogLog with given precision.
Args:
precision: Number of bits for register indexing (typically 10-16).
More bits = more registers = better accuracy but more memory.
m = 2^precision registers, alpha_m = 0.7213 / (1 + 1.079/m)
"""
self.precision = precision
self.m = 1 << precision # 2^precision registers
self.registers = [0] * self.m
# Bias correction constant (alpha_m) for harmonic mean
if self.m >= 128:
self.alpha = 0.7213 / (1 + 1.079 / self.m)
elif self.m >= 64:
self.alpha = 0.709
elif self.m >= 32:
self.alpha = 0.697
else:
self.alpha = 0.673
def _hash(self, item):
"""Hash item to 64-bit integer."""
if isinstance(item, str):
item = item.encode('utf-8')
return int(hashlib.sha1(item).hexdigest(), 16) & ((1 << 64) - 1)
def _leading_zero_count(self, bits, start_bit):
"""Count leading zeros in bits starting from start_bit (1-indexed)."""
# Extract remaining bits after register index
remaining = bits & ((1 << (64 - self.precision)) - 1)
# Count leading zeros in the remaining bits
if remaining == 0:
return 64 - self.precision + 1
# Find position of first 1-bit (leading zeros + 1)
return (64 - self.precision) - remaining.bit_length() + 1
def add(self, item):
"""
Add an item to the HyperLogLog.
Process:
1. Hash item to 64-bit value
2. Use first 'precision' bits as register index
3. Count leading zeros in remaining bits
4. Update register with max value
"""
hash_value = self._hash(item)
# First 'precision' bits determine register index
register_index = hash_value >> (64 - self.precision)
# Count leading zeros in remaining bits
rho = self._leading_zero_count(hash_value, self.precision)
# Update register with maximum rho seen
self.registers[register_index] = max(self.registers[register_index], rho)
def count(self):
"""
Estimate cardinality using harmonic mean of register values.
Raw estimate: alpha_m * m^2 / (sum of 2^(-register[i]))
Applies bias correction for small/large ranges.
"""
# Harmonic mean: calculate sum of 2^(-register[i])
raw_sum = sum(pow(2, -x) for x in self.registers)
raw_estimate = self.alpha * (self.m ** 2) / raw_sum
# Small range correction (if estimate is small)
if raw_estimate <= 2.5 * self.m:
zeros = self.registers.count(0)
if zeros != 0:
return self.m * log2(self.m / float(zeros))
# Large range correction (if estimate exceeds 2^32 / 30)
if raw_estimate <= (1.0/30.0) * (1 << 32):
return raw_estimate
# No correction
return -(1 << 32) * log2(1 - raw_estimate / (1 << 32))
def merge(self, other):
"""Merge another HyperLogLog into this one (union operation)."""
if self.m != other.m:
raise ValueError("Cannot merge HyperLogLog with different precision")
for i in range(self.m):
self.registers[i] = max(self.registers[i], other.registers[i])
Redis PFCOUNT
Redis implements HyperLogLog with the PFADD, PFCOUNT, and PFMERGE commands. Each key uses a 12 KB register set (16,384 registers x 6 bits each), stored in a specially optimized format that packs bits efficiently. The PFMERGE operation enables distributed counting across shards — merge HyperLogLog data from multiple servers without re-scanning raw elements. Standard error is ~0.81% with precision=14.
Presto & Trino
Both engines provide approx_distinct(column) aggregation functions built on HyperLogLog. Configurable precision allows tuning the trade-off between accuracy and memory. The relative standard deviation (error) is approximately 1.04/sqrt(m). Particularly useful in query optimization — cardinality estimation guides join order and filter pushdown decisions without materializing result sets.
Apache Spark
Spark’s approx_count_distinct() action uses HyperLogLog internally. Supports a relativeSD parameter to control the relative standard deviation. Seamlessly integrates with DataFrame API for large-scale analytics, enabling approximate aggregations across billions of records in seconds.
Apache Flink
Flink uses HyperLogLog for sliding/tumbling window cardinality estimation in real-time stream processing. Enables low-latency approximate distinct count queries on unbounded streams without storing all unique values in state.
BigQuery
Google’s BigQuery exposes APPROX_COUNT_DISTINCT() in its SQL dialect. The engine automatically selects HyperLogLog precision based on the query and data distribution. Perfect for interactive analytics where approximate results suffice and sub-second latency is critical.
References
- Flajolet et al., “HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm” (2007)
- Heule et al., “HyperLogLog in Practice” (Google/HyperLogLog++ with bias correction, 2013)
- Redis University: HyperLogLog Internals
- Wikipedia: HyperLogLog
- Antirez Blog: HyperLogLog implementation details