Post

Binary Search & Variations

Divide search space in half each step -- O(log n) on any sorted or monotonic space. Far more powerful than sorted array lookups: query optimization, version control, logging systems, and rate limiting.

Binary Search & Variations

Problem: Find a target value in a sorted array.

Algorithm:

1
2
3
4
5
6
7
8
9
10
left, right = 0, len(arr) - 1
while left <= right:
    mid = (left + right) // 2
    if arr[mid] == target:
        return mid                 # Found!
    elif arr[mid] < target:
        left = mid + 1             # Search right half
    else:
        right = mid - 1            # Search left half
return -1                          # Not found
Complexity: O(log n) time O(1) space

Key invariant: After each step, target (if it exists) is in [left, right].

Variations — When to Use Each

Variation What It Solves Key Change Example
Classic Find exact value in sorted array Return on == Find key in B-Tree node
Leftmost Find first occurrence (duplicates) When found: right = mid - 1 Find start of range in database
Rightmost Find last occurrence (duplicates) When found: left = mid + 1 Find end of range in database
On answer space “Minimum/maximum X satisfying condition” Binary search on result range, not array Minimum bandwidth to serve load
Rotated array Sorted array rotated at unknown pivot Check which half is sorted Search in rotated sorted array
2D matrix Row × col as linear index mid maps to [mid//cols][mid%cols] Search in sorted matrix

Binary Search on Answer Space (Most Useful in System Design)

Pattern: You want to find the minimum (or maximum) value of X such that some condition holds.

Approach:

  1. Define the search space: [lo, hi] = possible range of X values
  2. For candidate X, write a feasibility check: can we achieve the goal with this X?
  3. If feasibility check returns True, try smaller X (right = mid - 1); else try larger X (left = mid + 1)
  4. Answer is the leftmost (or rightmost) X that passes

Example: Minimum Bandwidth for Request Service

Problem: Given N requests arriving over time T, what is the minimum bandwidth B needed to serve all requests?

Approach:

1
2
3
4
5
6
7
8
9
10
lo = 0, hi = MAX_BANDWIDTH
while lo < hi:
    mid = (lo + hi) // 2
    if can_serve_all_requests(requests, T, mid):
        # mid bandwidth is enough; try smaller
        hi = mid
    else:
        # mid bandwidth insufficient; need more
        lo = mid + 1
answer = lo

Key Properties Table

Property Value
Time Complexity O(log n) for search + O(f(n)) for feasibility check per iteration
Space Complexity O(1) if iterative
Requires Sorted array or monotonic predicate
Works on Integer arrays, answer spaces, ranges

When to Use / Avoid

Use binary search when:

  • Data is sorted or sorted order is easy to check
  • Need to find a value, range boundary, or satisfy a condition
  • Time budget is tight (log n is vastly better than linear)
  • Can define a clear predicate: “does X work?” → true/false

Use binary search on answer space when:

  • Goal is to minimize/maximize a value X
  • Can efficiently check if a candidate X satisfies the goal
  • Answer space is large (linear search infeasible)
  • Problem has monotonic property (if X works, all larger X work; if X fails, all smaller X fail)

Avoid binary search when:

  • Data is unsorted and sorting cost exceeds benefit
  • Need all occurrences (linear scan may be simpler)
  • Feasibility check is O(n) — total becomes O(n log n) (may not beat sorting once)
  • Predicate is not monotonic (binary search correctness fails)

System Design Applications

Use Case How Binary Search Applies Real System Impact
Database Index Seek B-Tree node search is binary search within each node PostgreSQL, MySQL, RocksDB O(log n) row lookups instead of O(n) table scan
Version Control Debugging Binary search commit history for first bad version Git bisect Find regression in 1000 commits in ~10 tests instead of 500
Kafka Log Lookup Binary search offset index to find message by timestamp Apache Kafka Find message in 400GB+ broker logs in microseconds
Rate Limiter Window Binary search request timestamps to find window boundary Cloudflare, Stripe Determine if request exceeds rate limit in O(log n)
Linux Memory Management Binary search virtual address to find page table entry Linux kernel page tables Translate virtual→physical address in 4-5 memory lookups
Capacity Planning Binary search fleet size: “minimum servers to handle load” AWS Auto-Scaling Find optimal instance count without manual tuning

How Real Systems Use This

PostgreSQL & RocksDB — B-Tree Index Seeks

When you query SELECT * FROM users WHERE id = 42 in PostgreSQL, the database uses a B-Tree index to locate the row. The B-Tree structure is hierarchical: the root node points to child nodes, which point to grandchild nodes, and so on. At each level, PostgreSQL performs binary search within the node to find which child pointer to follow. For a typical B-Tree with 8KB page size and 100-byte entries, each node holds ~80 keys. A 1-billion-row table has a B-Tree depth of 4-5 levels. Binary search in each node takes log₂(80) ≈ 7 comparisons per level; total = ~35 comparisons to locate one row. Without the index, a full table scan would require 1 billion row comparisons. Real impact: indexed queries complete in microseconds; sequential scans take seconds. PostgreSQL’s implementation uses the bsearch() family of functions at each B-Tree node to find the correct child pointer.

Git Bisect — Finding First Bad Commit

A developer notices production is broken but doesn’t know when it broke. Git has 2000 commits in master branch since the last known-good version. Using git bisect, the developer runs: git bisect start, git bisect bad (current HEAD), git bisect good v2.1.0 (last known-good tag). Git binary searches the commit history: checks out the commit at the midpoint (commit 1000), and asks “is this good or bad?” Developer tests and replies. Git eliminates 1000 commits. On next iteration, checks commit 500, eliminates 500 more. After ~log₂(2000) ≈ 11 bisections, Git pinpoints the exact commit that introduced the bug. Without bisect, developer would manually test commits sequentially: ~1000 tests. With bisect: 11 tests. Real time savings: 2-3 hours of manual testing reduced to 15 minutes. For massive codebases (Linux kernel with 1M+ commits), bisect becomes indispensable.

Apache Kafka — Binary Search on Log Offsets

Kafka brokers store messages in immutable log files (segments). A consumer wants to read all messages produced after timestamp T (e.g., “give me messages from the last hour”). The broker maintains two indices per segment: an offset index (maps logical offset → byte position in log file) and a timestamp index (maps timestamp → offset). When a consumer requests messages after time T, Kafka uses binary search on the timestamp index to find the corresponding offset, then binary searches the offset index to find the byte position in the log file. Real numbers: a Kafka broker with 50 partitions, each with 48-hour retention, stores ~100GB per partition. Each segment is ~1GB. With 100 segments per partition, binary search finds a message in log₂(100) ≈ 7 comparisons instead of scanning linearly. For 1 billion messages per day per partition, this is the difference between microsecond lookups and multi-second scans. The index.interval.bytes parameter (default 4096) controls index density: lower density = fewer index entries = faster binary search, but coarser granularity within segments.

Git Bisect — Automated Regression Testing

More sophisticated: developers write a test script test-regression.sh that returns 0 (pass) or 1 (fail). git bisect run ./test-regression.sh automates the entire bisect: Git checks out commits and runs the script, interpreting results automatically. For a 1000-commit regression: manual bisect = 11 human iterations × 2 minutes each = 22 minutes. Automated bisect = 11 script runs × 30 seconds each = 5.5 minutes. Real systems (Kubernetes, Linux kernel) rely on bisect for continuous integration regression detection.

Linux Kernel — Binary Search in Page Tables

Virtual address translation: when a CPU accesses a virtual address, the kernel must translate it to a physical address. Modern x86-64 systems use a 5-level page table hierarchy. To translate address 0x7fff800000, the kernel navigates: L1 table → L2 table → L3 table → L4 table → L5 table. Each table is stored in memory (4KB page, 512 entries per page). Instead of linear search, the kernel uses binary search within each page table entry array to find the correct pointer. Real impact: address translation happens on every memory access (billions per second). Without binary search, CPU would need linear scans. Binary search (log₂(512) ≈ 9 comparisons) vs linear (avg 256 comparisons) per memory access = massive speedup. Most modern processors use a Translation Lookaside Buffer (TLB) cache to avoid this lookup, but when TLB misses occur, binary search is critical.

Cloudflare Rate Limiting — Window Boundary Detection

Cloudflare rate limits requests per IP address per time window (e.g., 100 requests/minute). For each IP, the system maintains a log of request timestamps over the last minute. When a new request arrives at timestamp T, the system must check: “how many requests occurred in [T-60s, T]?” Using linear scan, this requires iterating through all requests in the window (potentially thousands). With binary search on the timestamp log, Cloudflare finds the leftmost timestamp ≥ (T - 60s) in O(log n) time. Real-world scenario: DDoS attack sends 50,000 requests/second from different IPs. For each request, rate limiter must check if the IP is over quota. Binary search on the timestamp log makes this O(log n) per request; linear scan would be O(n) = O(10,000s) of computations per second, causing the rate limiter to become the bottleneck. Binary search enables Cloudflare to handle 50,000 RPS per edge server.

DynamoDB partitions data across shards by hash of the partition key. When a query comes in with partition key K, DynamoDB must determine which shard owns K. The shard range is divided into tokens: [0, 2^64-1]. Each shard owns a contiguous range of tokens. DynamoDB maintains a sorted list of shard ranges; binary search finds which shard owns K in O(log S) time (S = number of shards). Without binary search, lookup is O(S). At scale (Amazon runs millions of DynamoDB partitions), this matters: binary search keeps per-request latency at microseconds.

Implementation

Classic Binary Search

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
def binary_search(arr, target):
    """
    Find target in sorted array.

    Args:
        arr: sorted array
        target: value to find

    Returns:
        index of target, or -1 if not found

    Time: O(log n)
    Space: O(1)
    """
    left, right = 0, len(arr) - 1

    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1

    return -1


def binary_search_leftmost(arr, target):
    """
    Find leftmost (first) occurrence of target.

    Useful when array has duplicates.
    """
    left, right = 0, len(arr) - 1
    result = -1

    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target:
            result = mid
            right = mid - 1  # Continue searching left
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1

    return result


def binary_search_rightmost(arr, target):
    """
    Find rightmost (last) occurrence of target.
    """
    left, right = 0, len(arr) - 1
    result = -1

    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target:
            result = mid
            left = mid + 1  # Continue searching right
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1

    return result


# Example usage
if __name__ == "__main__":
    arr = [1, 3, 3, 3, 5, 7, 9, 11, 13]

    print(binary_search(arr, 3))            # Output: 2 (any occurrence)
    print(binary_search_leftmost(arr, 3))   # Output: 1 (first)
    print(binary_search_rightmost(arr, 3))  # Output: 3 (last)
    print(binary_search(arr, 6))            # Output: -1 (not found)

Binary Search on Answer Space

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def min_bandwidth_for_requests(requests, time_limit, throughput_options):
    """
    Find minimum throughput (requests/sec) needed to serve all requests
    within time_limit.

    Args:
        requests: list of request sizes (bytes)
        time_limit: time window (seconds)
        throughput_options: sorted list of available throughputs

    Returns:
        minimum throughput from options that can serve all requests

    Time: O(log T * n) where T = number of throughput options, n = number of requests
    """
    def can_serve(throughput):
        """Check if given throughput can serve all requests in time_limit."""
        time_used = sum(req / throughput for req in requests)
        return time_used <= time_limit

    left, right = 0, len(throughput_options) - 1
    result = -1

    while left <= right:
        mid = (left + right) // 2
        if can_serve(throughput_options[mid]):
            result = throughput_options[mid]
            right = mid - 1  # Try smaller throughput
        else:
            left = mid + 1   # Need larger throughput

    return result


def min_servers_for_load(num_requests, request_distribution, time_window,
                         requests_per_server):
    """
    Find minimum number of servers needed to handle peak load.

    Args:
        num_requests: total requests in time_window
        request_distribution: peak requests/sec expected
        time_window: time window (seconds)
        requests_per_server: max RPS one server can handle

    Returns:
        minimum number of servers

    Binary search on answer space: [0, num_requests]
    """
    def can_handle(num_servers):
        """Can num_servers handle the peak load?"""
        total_capacity = num_servers * requests_per_server
        return total_capacity >= request_distribution

    left, right = 1, num_requests

    while left < right:
        mid = (left + right) // 2
        if can_handle(mid):
            right = mid         # Try fewer servers
        else:
            left = mid + 1      # Need more servers

    return left


# Example usage
if __name__ == "__main__":
    # Minimum throughput
    requests = [100, 200, 150, 300]  # sizes in bytes
    time_limit = 1.0                  # 1 second
    throughputs = [100, 200, 500, 1000]  # bytes/sec

    min_tp = min_bandwidth_for_requests(requests, time_limit, throughputs)
    print(f"Minimum throughput: {min_tp} bytes/sec")

    # Minimum servers
    peak_rps = 10000                  # 10k requests/sec peak
    rps_per_server = 1000             # each server handles 1k RPS
    min_svr = min_servers_for_load(100000, peak_rps, 100, rps_per_server)
    print(f"Minimum servers: {min_svr}")

Common Pitfalls

❌ Overflow: mid = (left + right) // 2 in languages like C/C++ can overflow if left and right are very large integers. Use mid = left + (right - left) // 2 instead. (Python handles big integers automatically, so this is not a concern in Python.)

❌ Off-by-one errors: The condition left <= right vs left < right differs by variant. For leftmost/rightmost searches, track what your loop invariant means at termination.

❌ Assuming monotonicity: Binary search only works if the predicate is monotonic: “if X works, all larger X work” or “if X fails, all smaller X fail.” If the predicate has a “valley” (works, then fails, then works again), binary search will fail.

❌ Not verifying the predicate: Always write a clear feasibility check can_achieve(X) and test it independently before using in binary search.

References

This post is licensed under CC BY 4.0 by the author.