Count-Min Sketch

A 2D hash table that estimates frequency of any element in a stream using fixed memory -- with bounded overcount error, never undercount.

Posted Feb 10, 2024

6 min read

Count-Min Sketch

A 2D hash table that estimates frequency of any element in a stream using fixed memory — with bounded overcount error, never undercount.

Key Properties

Property	Value
Space	O(d × w) — fixed!
Update	O(d) — d hash functions
Query	O(d)
Error bound	ε = e/w overcount with probability 1 − δ = 1 − e^(−d)
False positives	✅ Possible (overcount)
False negatives	❌ Never (never undercount)
Deletions	✅ Supported (decrement)

vs HyperLogLog

	Count-Min Sketch	HyperLogLog
Answers	“How often is X?”	“How many distinct?”
Error type	Overcount	~2% cardinality error
Deletions	✅ Yes	❌ No

When to Use ✅ / Avoid ❌

✅ Heavy-hitter detection (top trending queries, DDoS IPs) ✅ Frequency estimation in high-speed streams ✅ Memory is strictly bounded ❌ Need exact counts (use hash map) ❌ Need to find ALL heavy hitters (combine with a heap)

Real Systems

Implementation

        
      
import hashlib
from typing import List, Tuple

class CountMinSketch:
    """
    A probabilistic data structure for estimating frequency of elements in a stream.
    Uses d hash functions and a 2D table to bound overcount error while never undercounting.
    """

    def __init__(self, width: int, depth: int):
        """
        Initialize Count-Min Sketch with width and depth parameters.
        - width: number of columns (larger = lower collision probability)
        - depth: number of rows / hash functions (larger = lower error probability)
        Memory usage: O(width * depth) — fixed and bounded.
        """
        self.width = width
        self.depth = depth
        # 2D table: table[i][j] = count from row i, column j
        self.table: List[List[int]] = [[0] * width for _ in range(depth)]
        # Generate deterministic seeds for hash functions (one per row)
        self.seeds = [i * 31 + 17 for i in range(depth)]

    def _hash(self, item: str, seed: int) -> int:
        """
        Universal hash function: map item+seed to column index.
        Uses Python's hashlib for deterministic, well-distributed hashing.
        Returns: column index in [0, width)
        """
        # Combine item and seed to produce unique hash per row
        hash_val = hash((item, seed))
        # Map to valid column index using modulo
        return hash_val % self.width

    def add(self, item: str, count: int = 1) -> None:
        """
        Add count occurrences of item to the sketch.
        For each row i, compute col = hash(item, seed[i]), then increment table[i][col] by count.
        Time: O(depth) — constant time per row.
        Why: We increment ALL d hash positions so ANY collision only causes overcount, never undercount.
        """
        for i in range(self.depth):
            # Compute column for this row using row-specific seed
            col = self._hash(item, self.seeds[i])
            # Increment counter at this position
            self.table[i][col] += count

    def estimate(self, item: str) -> int:
        """
        Estimate the frequency of item in the stream.
        Return the MINIMUM count across all d rows.
        Why minimum? Each row is an independent estimate. Collisions only cause overcount.
        The true count is at least min(...) and at most min(...) + error.
        """
        min_count = float('inf')
        for i in range(self.depth):
            # Compute column for this row
            col = self._hash(item, self.seeds[i])
            # The estimate from this row
            count = self.table[i][col]
            # Keep track of minimum
            min_count = min(min_count, count)
        return int(min_count) if min_count != float('inf') else 0

    def __repr__(self) -> str:
        """Pretty-print the table for debugging."""
        result = ["Count-Min Sketch Table:"]
        for i, row in enumerate(self.table):
            result.append(f"Row {i}: {row}")
        return "\n".join(result)

# Example usage
if __name__ == "__main__":
    cms = CountMinSketch(width=6, depth=3)

    # Add items
    cms.add("apple", 3)
    cms.add("banana", 1)
    cms.add("apple", 1)  # apple now has 4 total

    # Estimate frequencies
    print(f"Estimated 'apple': {cms.estimate('apple')}")  # Should be 4
    print(f"Estimated 'banana': {cms.estimate('banana')}")  # Should be 1
    print(f"Estimated 'cherry' (not added): {cms.estimate('cherry')}")  # Should be 0
    print("\n" + cms)

Algorithm Flow: Concrete Example

Twitter ingests approximately 500 million tweets per day into a real-time firehose. To compute trending hashtags, the service needs to maintain per-hashtag frequency counts. A traditional hash map would require O(distinct hashtags) memory — potentially 10+ million entries at peak. Instead, Twitter uses Count-Min Sketch with time-decaying counts: frequencies decay exponentially over minutes, so “trending” reflects recent viral activity, not historical totals. A 5MB sketch provides sufficient precision while being replicated across data centers for fault tolerance.

Redis (RedisBloom Module)

Redis offers CMS.INITBYDIM key width depth (manual sizing) and CMS.INITBYPROB key error probability (auto-sizes parameters). Operations include CMS.INCRBY key item count and CMS.QUERY key item. This is widely deployed in gaming leaderboards (track high-score frequency to prevent fraud), e-commerce inventory tracking (estimate demand per SKU in real-time), and event frequency analysis in user telemetry. The fixed memory footprint makes it predictable for operations and capacity planning.

Apache Flink / Spark Streaming

Both streaming frameworks integrate Count-Min Sketch into their DataStream and Spark Streaming APIs. When aggregating events across distributed partitions, exact frequency counts would require centralizing state and shuffling massive datasets. Instead, sketches are computed per partition, then merged via addition (CMS is composable: adding sketches approximates combining streams). Used for windowed frequency estimation where memory is critical — e.g., identifying heavy hitters in click streams without materializing exact counts.

DDoS Detection (Akamai, Cloudflare)

CDN edge nodes defend against DDoS by tracking per-source-IP request frequencies within a time window. At the edge, a single node may see traffic from millions of IPs simultaneously. A 4MB Count-Min Sketch can track millions of distinct sources while consuming minimal cache bandwidth. When estimate(ip) > threshold within a rolling window, the IP is flagged for rate limiting or blocking. Exact per-IP counters would consume gigabytes per edge location.

Ad-Tech — Frequency Capping

Ad platforms enforce “show this ad at most 5 times per user per day” to prevent ad fatigue. For billions of active users, maintaining exact impression counts per (user, campaign) pair in a database is expensive. Instead, data centers run dedicated CMS instances: one sketch per region, updated via log aggregation. A 10MB sketch can track hundreds of millions of users with <1% error rate, enabling frequency cap enforcement at query-time via estimate(user_id).

References

📄 Cormode & Muthukrishnan 2005 — An Improved Data Stream Summary: The Count-Min Sketch — the original peer-reviewed CMS paper with error bound proofs
🎥 Algorithmica — Count-Min Sketch Explained — visual walkthrough with interactive worked examples
📄 Graham Cormode Tutorial Slides — accessible explanation with detailed error bound derivations
🔗 Redis Count-Min Sketch Documentation — API reference with sizing guidance and production configurations
📖 Wikipedia: Count-Min Sketch — error bounds: P(estimate > true_freq + ε·N) < δ with parameters

Software Architecture, Data Structures

algorithms

This post is licensed under CC BY 4.0 by the author.