HyperLogLog

Count distinct elements in a stream using O(log log n) memory -- with ~2% error rate, vs O(n) for an exact set.

Posted Oct 1, 2023

5 min read

HyperLogLog

Core Idea

Intuition: Rare event (many leading zeros) -> large number of distinct elements seen. Track the maximum rarity seen per bucket.

Accuracy vs Memory

Approach	Memory	Error
Exact hash set	O(n)	0%
HyperLogLog (b=10)	~1 KB	~2%
HyperLogLog (b=14)	~16 KB	~0.8%
HyperLogLog++	~2.5 KB	~1%

Key Properties

Property	Value
Memory	O(log log n)
Typical error	0.81% to 2%
Insert	O(1)
Query (estimate)	O(1)
Merge two HLLs	O(m) — union without re-scanning
Deletions	Not supported

Highlights

Streaming cardinality: Process elements one-by-one, never store all data
Merge-friendly: Combine HLL counts from multiple streams (shards)
Tiny footprint: Seconds of storage for millions of distinct elements
Math-driven: Probabilistic guarantees with known error bounds

When to Use / Avoid

Use to count unique visitors / users at scale
Use for distinct query counts in analytics
Use when memory is constrained, ~2% error acceptable
Use when you need to merge counts across shards
Avoid when you need exact counts
Avoid for small datasets (just use a hash set)

Real Systems

System	How Used
Redis	`PFADD` / `PFCOUNT` commands
PostgreSQL	`approx_count_distinct()`
Presto / Spark	Approximate DISTINCT aggregation
Twitter	Trending topic cardinality
Google Analytics	Unique visitor estimation

Implementation

        
      
import hashlib
import struct
from math import log2, pow

class HyperLogLog:
    """
    HyperLogLog: Probabilistic cardinality estimation.
    Uses O(log log n) memory to estimate distinct elements with ~0.81% error.
    """

    def __init__(self, precision=14):
        """
        Initialize HyperLogLog with given precision.

        Args:
            precision: Number of bits for register indexing (typically 10-16).
                      More bits = more registers = better accuracy but more memory.
                      m = 2^precision registers, alpha_m = 0.7213 / (1 + 1.079/m)
        """
        self.precision = precision
        self.m = 1 << precision  # 2^precision registers
        self.registers = [0] * self.m

        # Bias correction constant (alpha_m) for harmonic mean
        if self.m >= 128:
            self.alpha = 0.7213 / (1 + 1.079 / self.m)
        elif self.m >= 64:
            self.alpha = 0.709
        elif self.m >= 32:
            self.alpha = 0.697
        else:
            self.alpha = 0.673

    def _hash(self, item):
        """Hash item to 64-bit integer."""
        if isinstance(item, str):
            item = item.encode('utf-8')
        return int(hashlib.sha1(item).hexdigest(), 16) & ((1 << 64) - 1)

    def _leading_zero_count(self, bits, start_bit):
        """Count leading zeros in bits starting from start_bit (1-indexed)."""
        # Extract remaining bits after register index
        remaining = bits & ((1 << (64 - self.precision)) - 1)

        # Count leading zeros in the remaining bits
        if remaining == 0:
            return 64 - self.precision + 1

        # Find position of first 1-bit (leading zeros + 1)
        return (64 - self.precision) - remaining.bit_length() + 1

    def add(self, item):
        """
        Add an item to the HyperLogLog.

        Process:
        1. Hash item to 64-bit value
        2. Use first 'precision' bits as register index
        3. Count leading zeros in remaining bits
        4. Update register with max value
        """
        hash_value = self._hash(item)

        # First 'precision' bits determine register index
        register_index = hash_value >> (64 - self.precision)

        # Count leading zeros in remaining bits
        rho = self._leading_zero_count(hash_value, self.precision)

        # Update register with maximum rho seen
        self.registers[register_index] = max(self.registers[register_index], rho)

    def count(self):
        """
        Estimate cardinality using harmonic mean of register values.

        Raw estimate: alpha_m * m^2 / (sum of 2^(-register[i]))
        Applies bias correction for small/large ranges.
        """
        # Harmonic mean: calculate sum of 2^(-register[i])
        raw_sum = sum(pow(2, -x) for x in self.registers)
        raw_estimate = self.alpha * (self.m ** 2) / raw_sum

        # Small range correction (if estimate is small)
        if raw_estimate <= 2.5 * self.m:
            zeros = self.registers.count(0)
            if zeros != 0:
                return self.m * log2(self.m / float(zeros))

        # Large range correction (if estimate exceeds 2^32 / 30)
        if raw_estimate <= (1.0/30.0) * (1 << 32):
            return raw_estimate

        # No correction
        return -(1 << 32) * log2(1 - raw_estimate / (1 << 32))

    def merge(self, other):
        """Merge another HyperLogLog into this one (union operation)."""
        if self.m != other.m:
            raise ValueError("Cannot merge HyperLogLog with different precision")

        for i in range(self.m):
            self.registers[i] = max(self.registers[i], other.registers[i])

Redis PFCOUNT

Redis implements HyperLogLog with the PFADD, PFCOUNT, and PFMERGE commands. Each key uses a 12 KB register set (16,384 registers x 6 bits each), stored in a specially optimized format that packs bits efficiently. The PFMERGE operation enables distributed counting across shards — merge HyperLogLog data from multiple servers without re-scanning raw elements. Standard error is ~0.81% with precision=14.

Presto & Trino

Both engines provide approx_distinct(column) aggregation functions built on HyperLogLog. Configurable precision allows tuning the trade-off between accuracy and memory. The relative standard deviation (error) is approximately 1.04/sqrt(m). Particularly useful in query optimization — cardinality estimation guides join order and filter pushdown decisions without materializing result sets.

Apache Spark

Spark’s approx_count_distinct() action uses HyperLogLog internally. Supports a relativeSD parameter to control the relative standard deviation. Seamlessly integrates with DataFrame API for large-scale analytics, enabling approximate aggregations across billions of records in seconds.

Apache Flink

Flink uses HyperLogLog for sliding/tumbling window cardinality estimation in real-time stream processing. Enables low-latency approximate distinct count queries on unbounded streams without storing all unique values in state.

BigQuery

Google’s BigQuery exposes APPROX_COUNT_DISTINCT() in its SQL dialect. The engine automatically selects HyperLogLog precision based on the query and data distribution. Perfect for interactive analytics where approximate results suffice and sub-second latency is critical.

References

Software Architecture, Data Structures

algorithms

This post is licensed under CC BY 4.0 by the author.