Sorting Algorithms

Choosing the right sort determines latency, memory usage, and scalability in distributed systems -- O(n log n) on small datasets vs. O(n + k) for structured data vs. external merge sort for big data.

Posted Nov 1, 2024

12 min read

Sorting Algorithms

Choosing the right sort determines latency, memory usage, and scalability in distributed systems — O(n log n) on small datasets vs. O(n + k) for structured data vs. external merge sort for big data.

Key Properties

Algorithm	Time (Avg)	Time (Worst)	Space	Stable?	In-Place?	Best For
Insertion Sort	O(n²)	O(n²)	O(1)	✅	✅	Small arrays, nearly sorted data, base case for hybrid sorts
Merge Sort	O(n log n)	O(n log n)	O(n)	✅	❌	External sort, linked lists, guaranteed performance, distributed sort
Quick Sort	O(n log n)	O(n²)	O(log n)	❌	✅	In-memory, cache-friendly, average fastest, most production default
Heap Sort	O(n log n)	O(n log n)	O(1)	❌	✅	In-place, guaranteed O(n log n), kernel/embedded systems
Tim Sort	O(n) best, O(n log n) avg/worst	O(n log n)	O(n)	✅	❌	Real-world data with runs, Python/Java default, adaptive sorting
Intro Sort	O(n log n)	O(n log n)	O(log n)	❌	✅	C++ std::sort, hybrid speed + guaranteed performance
Counting Sort	O(n + k)	O(n + k)	O(k)	✅	❌	Small integer range (age 1–120, scores 1–100)
Radix Sort	O(nk)	O(nk)	O(n+k)	✅	❌	Fixed-length integers/strings, no comparisons needed
Bucket Sort	O(n + k)	O(n²)	O(n + k)	✅	❌	Uniformly distributed data, float ranges

When to Use / Avoid

✅ Use Comparison-Based Sort (O(n log n))

General-purpose sorting — no constraints on data distribution
Mixed data types — need to define comparator once
Guaranteed performance — can accept O(n log n) worst-case
Small datasets — performance difference negligible below ~10K elements

❌ Avoid Comparison-Based Sort

Integers with small range (1–1000) — switch to counting/radix sort for O(n)
Strict memory limit — merge sort needs O(n) extra space; use heap sort O(1) instead
Linked lists — quick sort performs poorly; use merge sort
Worst-case critical — quick sort has O(n²) worst case; use heap sort or intro sort

✅ Use Merge Sort

External sorting — data doesn’t fit in RAM; optimal for k-way merge
Stable sort required — deduplicate while sorting, preserve original order
Linked lists — naturally suited, no random access needed
Distributed sorting — MapReduce, Spark; built-in shuffle phase uses external merge

❌ Avoid Merge Sort

Strict space constraints — needs O(n) temporary storage
In-memory only — quick sort faster on arrays with good cache locality
Partially sorted data — TimSort adapts better, runs in O(n) best case

✅ Use Tim Sort

Real-world data — contains natural runs (already-sorted sequences)
Language runtime — Python sorted() / list.sort(), Java Arrays.sort(objects)
Adaptive sorting — 50% faster than O(n log n) on partially sorted data
Stability + speed — best of both worlds for production systems

✅ Use Linear-Time Sort (Counting/Radix)

Bounded integer range — ages (1–120), scores (0–100), digits (0–9)
Need O(n) performance — comparison sort bottleneck unacceptable
Not comparing objects — only keys of fixed width

✅ Use Heap Sort (Kernel/Embedded)

Guaranteed O(n log n) — security-critical, must prevent timing attacks
Constant space O(1) — embedded systems, kernel code
No extra allocations — strictly in-place, no malloc calls

How Real Systems Use This

Python & Java Runtime (Tim Sort — Adaptive)

Python’s list.sort() and sorted() use TimSort, designed by Tim Peters in 2002. The algorithm identifies natural “runs” (maximal monotonic subsequences) in the input and merges them intelligently. For partially sorted data (which dominates real-world usage), TimSort runs in O(n) instead of O(n log n), delivering 2-3x speedup. Java 7+ uses identical logic for Arrays.sort(Object[]). The minimum run size is 32 elements (computed dynamically), chosen empirically; smaller runs trigger insertion sort, reducing comparisons. Real-world benchmark: sorting a list of 1 million pre-sorted integers takes ~5ms with TimSort vs. ~50ms with quicksort. Reference: Tim Peters’ original TimSort analysis.

MapReduce — External Merge Sort (Distributed)

MapReduce’s shuffle-and-sort phase uses external merge sort to handle datasets larger than any single machine’s RAM. The mapper emits (key, value) pairs sorted in-memory using quicksort; unsorted spills are written to local disk. During the reduce phase, MapReduce k-way merges hundreds of sorted segments from different mappers into a final sorted stream. For a 1TB dataset with 256MB per-node memory and 100 mappers: each mapper spills 4-5 sorted 256MB chunks; reducers perform a 400-way merge. This keeps I/O sequential (fast) and memory bounded. Complexity: O(n log n) for initial sort + O(n log k) for k-way merge where k is the number of sorted chunks. Reference: MapReduce Shuffle & Sort Architecture.

PostgreSQL — Internal Quick Sort + External Merge for ORDER BY

PostgreSQL sorts rows using quicksort for in-memory data (O(n log n) average, but limits recursion depth to prevent O(n²) worst-case). When result set exceeds work_mem (default 4MB), PostgreSQL spills to disk and performs external merge sort with automatic k-way merge. For a 1TB table with ORDER BY revenue DESC, the planner estimates: if <4MB fits in memory, quick sort in-memory; if 100GB output, spill to disk and k-way merge sorted runs. Modern PostgreSQL (13+) uses incremental sort — if input already partially sorted by index, it sorts only unsorted segments, reducing work. Reference: PostgreSQL ORDER BY Documentation.

Apache Spark — Sort-Merge Join (Distributed)

Spark’s sortByKey() and sort-merge join use external merge sort across distributed nodes. Each partition is sorted in-memory; if it exceeds executor memory, Spark spills to disk and performs external merge. For a join of two 500GB tables: Spark sorts both sides by join key using external sort, then streams through both sorted inputs simultaneously in O(n) merge phase. This avoids bringing full tables into memory (unlike hash join). Total complexity: O(n log n) sort + O(n) merge across distributed data. Reference: Apache Spark Sort & Shuffle Architecture.

Linux Kernel — Heap Sort (Safety-Critical)

The Linux kernel uses heapsort in lib/sort.c, not quicksort, because the kernel prioritizes guaranteed O(n log n) worst-case performance and constant O(1) extra space. Quicksort’s O(n²) worst-case could be exploited by malicious userspace to trigger kernel slowdown (denial-of-service). HeapSort provides an upper bound guarantee and is fully in-place, requiring no malloc (critical in kernel context). For sorting 1000 inode numbers during a filesystem operation, heapsort guarantees completion in ~10,000 comparisons; quicksort could require 1,000,000. Reference: Linux Kernel Sort Implementation.

C++ Standard Library — Intro Sort (Hybrid for Speed + Safety)

C++’s std::sort() uses introsort (introspective sort), a hybrid of quicksort + heapsort + insertion sort. It starts with quicksort (fast average case, cache-friendly), monitors recursion depth, and switches to heapsort if depth exceeds 2*log(n) (preventing O(n²) worst-case). Base case (arrays <16 elements) uses insertion sort (fewer comparisons, better cache). Real-world impact: sorting 10 million random integers runs in ~500ms with introsort vs. ~600ms with pure quicksort. Reference: IntroSort Overview.

Implementation

Merge Sort with External Sort Support

        
      
def merge_sort(arr, left=0, right=None):
    """
    Stable merge sort with O(n log n) guaranteed time and O(n) space.

    Args:
        arr: list to sort (modifies in-place)
        left: start index
        right: end index

    Why merge sort:
    - Guaranteed O(n log n) — no worst-case degradation
    - Stable — preserves order of equal elements (important for deduplication)
    - Natural for external sort — two sorted files merge trivially
    """
    if right is None:
        right = len(arr) - 1

    if left < right:
        mid = (left + right) // 2
        merge_sort(arr, left, mid)
        merge_sort(arr, mid + 1, right)
        merge(arr, left, mid, right)

def merge(arr, left, mid, right):
    """
    Merge two sorted subarrays arr[left:mid+1] and arr[mid+1:right+1].
    Time: O(n), Space: O(n) for temporary storage.
    """
    left_part = arr[left:mid + 1]
    right_part = arr[mid + 1:right + 1]

    i = j = 0
    k = left

    # Two-pointer merge: pick smaller element each iteration
    while i < len(left_part) and j < len(right_part):
        if left_part[i] <= right_part[j]:  # <= maintains stability
            arr[k] = left_part[i]
            i += 1
        else:
            arr[k] = right_part[j]
            j += 1
        k += 1

    # Copy remaining elements — use bounded slice to avoid overwriting
    # beyond the merge region (critical when sorting subarrays)
    remaining = left_part[i:] + right_part[j:]
    arr[k:k + len(remaining)] = remaining

# Example usage
if __name__ == "__main__":
    data = [38, 27, 43, 3, 9, 82, 10]
    merge_sort(data)
    print(f"Sorted: {data}")  # [3, 9, 10, 27, 38, 43, 82]

External Merge Sort (Big Data)

        
      
import heapq
import os

def external_merge_sort(input_file, output_file, chunk_size=1000, temp_dir="/tmp"):
    """
    External merge sort for files larger than RAM.

    Algorithm:
    1. Read chunk_size lines, sort in-memory, write to temp file
    2. Perform k-way merge on all temp files using min-heap

    Time: O(n log n), Space: O(chunk_size)

    Args:
        input_file: path to unsorted data (one number per line)
        output_file: path to sorted output
        chunk_size: lines to hold in memory (tune based on available RAM)
        temp_dir: directory for temporary sorted chunks
    """

    # Phase 1: Split and sort chunks
    chunk_files = []
    with open(input_file, 'r') as f:
        chunk_num = 0
        chunk = []
        for line in f:
            chunk.append(int(line.strip()))
            if len(chunk) >= chunk_size:
                # Sort chunk in-memory and write to disk
                chunk.sort()
                chunk_path = os.path.join(temp_dir, f"chunk_{chunk_num}.txt")
                with open(chunk_path, 'w') as cf:
                    cf.write('\n'.join(map(str, chunk)))
                chunk_files.append(chunk_path)
                chunk_num += 1
                chunk = []

        # Don't forget last partial chunk
        if chunk:
            chunk.sort()
            chunk_path = os.path.join(temp_dir, f"chunk_{chunk_num}.txt")
            with open(chunk_path, 'w') as cf:
                cf.write('\n'.join(map(str, chunk)))
            chunk_files.append(chunk_path)

    # Phase 2: K-way merge using min-heap
    # Open all chunk files and maintain heap of (value, chunk_index, line_iterator)
    file_handles = [open(f, 'r') for f in chunk_files]
    heap = []

    # Initialize heap with first element from each chunk
    for idx, fh in enumerate(file_handles):
        line = fh.readline().strip()
        if line:
            heapq.heappush(heap, (int(line), idx, fh))

    # Merge: repeatedly pop minimum and push next element from same chunk
    with open(output_file, 'w') as out:
        while heap:
            val, chunk_idx, fh = heapq.heappop(heap)
            out.write(f"{val}\n")

            # Try to push next element from same chunk
            line = fh.readline().strip()
            if line:
                heapq.heappush(heap, (int(line), chunk_idx, fh))

    # Cleanup
    for fh in file_handles:
        fh.close()
    for chunk_file in chunk_files:
        os.remove(chunk_file)

    print(f"External sort complete: {output_file}")

Quick Sort for In-Memory Performance

        
      
def quick_sort(arr, left=0, right=None):
    """
    Quick sort with average O(n log n) and cache-friendly performance.
    Worst-case O(n²) mitigated by median-of-three pivot selection.

    Why quicksort:
    - Fastest average case for in-memory data
    - In-place (O(log n) extra for recursion stack)
    - Excellent cache locality (partitioning is sequential)
    - Production default: std::sort in C++, Arrays.sort(primitives) in Java
    """
    if right is None:
        right = len(arr) - 1

    if left < right:
        # Partition and get pivot index
        pivot_idx = partition(arr, left, right)
        # Recursively sort left and right partitions
        quick_sort(arr, left, pivot_idx - 1)
        quick_sort(arr, pivot_idx + 1, right)

def partition(arr, left, right):
    """
    Partition array around pivot (median-of-three to avoid worst-case).
    Time: O(n), single pass through subarray.
    """
    # Median-of-three: choose pivot from first, middle, last
    mid = (left + right) // 2
    pivot_idx = median_of_three_idx(arr, left, mid, right)

    # Swap pivot to end
    arr[pivot_idx], arr[right] = arr[right], arr[pivot_idx]
    pivot_val = arr[right]

    # Partition: elements < pivot on left, >= on right
    i = left - 1
    for j in range(left, right):
        if arr[j] < pivot_val:
            i += 1
            arr[i], arr[j] = arr[j], arr[i]

    # Place pivot in final position
    arr[i + 1], arr[right] = arr[right], arr[i + 1]
    return i + 1

def median_of_three_idx(arr, a, b, c):
    """Return index of median of arr[a], arr[b], arr[c]."""
    if arr[a] <= arr[b]:
        if arr[b] <= arr[c]:
            return b
        elif arr[a] <= arr[c]:
            return c
        else:
            return a
    else:
        if arr[a] <= arr[c]:
            return a
        elif arr[b] <= arr[c]:
            return c
        else:
            return b

Sorting Decision Tree

Problem: I need to sort N elements. Which algorithm?

Does data fit in memory?
- No → External merge sort (MapReduce, Spark, databases)
- Yes → Continue
Are elements comparison-comparable integers in small range [0, k]?
- Yes, k < 1000 → Counting sort O(n + k)
- Yes, fixed-width → Radix sort O(nk)
- No → Continue
Can I afford O(n) extra space?
- Yes → Tim sort (Python/Java) or Merge sort (stable, linked lists)
- No → Continue
Worst-case guarantee critical? (kernel, security, embedded)
- Yes → Heap sort O(n log n), O(1) space
- No → Intro sort (C++ std::sort) or Quick sort
Data naturally partially sorted?
- Yes → Tim sort (O(n) best case)
- No → Quick sort (O(n log n) average, cache-friendly)

References

Tim Peters’ TimSort Algorithm Analysis — Original design document; explains run detection, galloping mode, and why O(n) best case is possible.
On the Worst-Case Complexity of TimSort — Munro & Wild (2018) — Mathematical proof of TimSort’s O(n log n) worst-case; practical implications for production sorting.
A Fast Sorting Algorithm — Musser (1997, IntroSort) — Introduces introsort; explains hybrid quicksort+heapsort approach used by C++ std::sort.
ByteByteGo — Sorting Algorithms Explained — Visual walkthrough of merge sort, quick sort, and when to use each; production examples.
Wikipedia: Sorting Algorithm — Comprehensive comparison table; stable vs. unstable, best/worst/average complexity.
Google MapReduce Paper: Shuffle and Sort Phase — Describes external merge sort in distributed context; k-way merge optimization.
Linux Kernel Sort Implementation — Heapsort in kernel; explains why guaranteed O(n log n) is security-critical.
PostgreSQL ORDER BY Documentation — Incremental sort, spill to disk, work_mem tuning.

Software Architecture

sorting comparison-based linear-time

This post is licensed under CC BY 4.0 by the author.