Post

Database Internals -- Key Data Structures

The 8 fundamental data structures that power how modern databases store, index, retrieve, and compact data -- from in-memory indexes (skip lists, hash tables) to disk structures (B+ trees, SSTables, LSM trees). Each trades off read speed vs write speed, and understanding these trade-offs is critical for system design.

Database Internals -- Key Data Structures

The 8 fundamental data structures that power how modern databases store, index, retrieve, and compact data — from in-memory indexes (skip lists, hash tables) to disk structures (B+ trees, SSTables, LSM trees). Each trades off read speed vs write speed, and understanding these trade-offs is critical for system design.

The 8 Core Structures at a Glance

# Structure In Memory? Disk? Complexity Best For Used In
1 Skip List O(log n) search Sorted in-memory index Redis sorted sets, LevelDB MemTable
2 Hash Index O(1) exact match In-memory lookups Hash tables, all languages, in-memory KV
3 B+ Tree O(log n) seek Balanced disk reads PostgreSQL, MySQL, SQLite, MongoDB
4 SSTable O(log n) via index Immutable sorted file Cassandra, LevelDB, RocksDB write path
5 LSM Tree ✅+❌ O(1) write, O(log n) read Write-heavy workloads RocksDB, Cassandra, InfluxDB, LevelDB
6 Inverted Index O(k) term lookup Full-text search Elasticsearch, Lucene, PostgreSQL FTS
7 Suffix Tree O(m) pattern match Substring search DNA sequencing, Aho-Corasick, grep
8 R-Tree O(log n) spatial Multi-dimensional range PostGIS, SQLite, location search

Structure 1: Skip List

What it is: Probabilistic data structure that mirrors a sorted linked list but with “express lanes” (higher levels) that skip ahead. Provides O(log n) search by skipping roughly n/2, n/4, n/8… elements.

Key insight: Skip lists are simpler than balanced trees (no rotations) but same complexity. Uses randomization to keep levels balanced without complex rebalancing logic.

Real systems:

  • Redis Sorted Sets: Primary implementation for ZADD, ZRANGE, ZRANK operations. Each sorted set member is a (score, value) pair indexed by both score (skip list) and value (hash table).
  • LevelDB MemTable: Writes go into an in-memory skip list (MemTable) before flushing to disk as SSTable. Provides fast insertion and range iteration.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import random

class SkipNode:
    def __init__(self, value, level):
        self.value = value
        self.forward = [None] * level
        self.level = level

class SkipList:
    def __init__(self, max_level=16, p=0.5):
        self.max_level = max_level
        self.p = p
        self.header = SkipNode(None, max_level)
        self.level = 1

    def _random_level(self):
        """Generate random level for new node."""
        level = 1
        while random.random() < self.p and level < self.max_level:
            level += 1
        return level

    def insert(self, value):
        """Insert value into skip list."""
        new_level = self._random_level()
        if new_level > self.level:
            self.level = new_level

        new_node = SkipNode(value, new_level)
        update = [self.header] * self.max_level

        # Find insertion position
        x = self.header
        for i in range(self.level - 1, -1, -1):
            while x.forward[i] and x.forward[i].value < value:
                x = x.forward[i]
            update[i] = x

        # Insert at all levels
        for i in range(new_level):
            new_node.forward[i] = update[i].forward[i]
            update[i].forward[i] = new_node

    def search(self, value):
        """Search for value — O(log n)."""
        x = self.header
        for i in range(self.level - 1, -1, -1):
            while x.forward[i] and x.forward[i].value < value:
                x = x.forward[i]

        x = x.forward[0]
        return x and x.value == value

Structure 2: Hash Index

What it is: Standard hash table (dictionary/map). O(1) average-case lookup, insertion, deletion. Used for exact-key lookups, not range queries.

Key trade-off: Fast exact-match, but no range capability (unlike tree indexes). Hash collisions degrade to O(n) in pathological cases (mitigated by good hash functions and load factor management).

Real systems:

  • All hash maps/dicts — Python dict, Java HashMap, Go map, Rust HashMap
  • PostgreSQL hash indexes — For equality lookups only (not range queries)
  • Redis Hashes (HSET/HGET) — Each hash field is a hash table entry

Structure 3: B+ Tree

What it is: Balanced disk-based tree where all keys are stored in leaves, and internal nodes are index pointers. Fanout of 100-200 means a 4-level tree holds millions of keys. All reads access the same number of disk pages (O(log n) pages).

Key property: Leaf pages are linked — allows efficient range queries (e.g., SELECT * FROM users WHERE age BETWEEN 30 AND 40). Internal nodes store only index boundaries, not data.

Page structure (PostgreSQL example):

  • Page size: 8 KB (on disk)
  • Fanout: ~200 keys per internal node, ~100 data rows per leaf
  • Fill factor: Default 90% (leaves filled to 90% to reduce fragmentation)
  • Example: A 4-level B+ tree with fanout 200 holds 200 × 200 × 200 × 100 = 800M rows, requiring only 4 disk seeks in worst case

Real systems:

  • PostgreSQL: B+ tree for all primary and secondary indexes
  • MySQL InnoDB: B+ tree for clustered index (table data lives in leaves)
  • SQLite: B+ tree implementation with 4KB pages
  • MongoDB: B+ tree for indexes (document IDs are sorted)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
class BTreeNode:
    """
    A node in a B-Tree. Each node has keys (sorted) and children (if internal).
    Order is the branching factor: each internal node has at most `order` children.
    Each node is written to a single disk page for locality.
    """
    def __init__(self, leaf=True, order=3):
        self.keys = []
        self.children = []  # Only for internal nodes
        self.leaf = leaf
        self.order = order
        self.max_keys = order - 1

    def is_full(self):
        """Check if node has reached maximum capacity."""
        return len(self.keys) >= self.max_keys

    def split_child(self, i):
        """
        Split the full child at index i.
        Move the middle key up to parent, split child into two nodes.
        Why: Keeps tree balanced — prevents chains of single keys.
        """
        child = self.children[i]
        mid_idx = self.max_keys // 2

        # Create new right sibling
        right = BTreeNode(leaf=child.leaf, order=self.order)
        right.keys = child.keys[mid_idx + 1:]

        if not child.leaf:
            right.children = child.children[mid_idx + 1:]

        # Trim left child and move middle key up
        mid_key = child.keys[mid_idx]
        child.keys = child.keys[:mid_idx]
        if not child.leaf:
            child.children = child.children[:mid_idx + 1]

        # Insert middle key into parent
        self.keys.insert(i, mid_key)
        self.children.insert(i + 1, right)

    def insert_non_full(self, key):
        """
        Insert key into this non-full node.
        If leaf, insert directly. If internal, recurse into appropriate child.
        Why: Insertion always happens in leaves first; propagates upward on splits.
        """
        idx = len(self.keys) - 1

        if self.leaf:
            # Insert directly into leaf (maintaining sort order)
            self.keys.append(None)
            while idx >= 0 and self.keys[idx] > key:
                self.keys[idx + 1] = self.keys[idx]
                idx -= 1
            self.keys[idx + 1] = key
        else:
            # Find child to recurse into
            while idx >= 0 and self.keys[idx] > key:
                idx -= 1
            idx += 1

            # If child is full, split it first
            if self.children[idx].is_full():
                self.split_child(idx)
                if self.keys[idx] < key:
                    idx += 1

            # Recurse
            self.children[idx].insert_non_full(key)

    def search(self, key):
        """Search for key in this subtree (O(log n) pages)."""
        idx = 0
        while idx < len(self.keys) and key > self.keys[idx]:
            idx += 1

        if idx < len(self.keys) and key == self.keys[idx]:
            return True  # Found

        if self.leaf:
            return False  # Not found

        return self.children[idx].search(key)


class BTree:
    """
    B-Tree database index. All insertions/searches access O(log n) disk pages.
    Order (fanout) of 100-200 ensures that a 4-level tree holds millions of rows.
    Real systems: PostgreSQL, MySQL InnoDB, SQLite all use B-Trees (or B+ variants).
    """
    def __init__(self, order=3):
        self.root = BTreeNode(leaf=True, order=order)
        self.order = order

    def insert(self, key):
        """Insert key into tree (O(log n) disk operations)."""
        if self.root.is_full():
            # Split root if needed
            new_root = BTreeNode(leaf=False, order=self.order)
            new_root.children.append(self.root)
            new_root.split_child(0)
            self.root = new_root

        self.root.insert_non_full(key)

    def search(self, key):
        """Search for key (O(log n) disk operations)."""
        return self.root.search(key)

B-Tree Properties in Production:

  • Page size: 8 KB (PostgreSQL) or 4 KB (SQLite) — determines fanout and depth
  • Fanout: ~100-200 keys per page (determines tree depth: O(log n))
  • Durability: Pages are fsync’d to disk on commit; survives crashes
  • Write amplification: Each insert may trigger page splits (small overhead)

Structure 4: SSTable (Sorted String Table)

What it is: Immutable, sorted key-value file on disk. Keys are in ascending order; values are arbitrary. Typically 2-64 MB in size. Used as the output of flushing in-memory data to disk (e.g., RocksDB flush, Cassandra flush).

Structure on disk:

1
2
3
[Key1, Value1] [Key2, Value2] [Key3, Value3] ...
^              ^              ^
Index block    Index block    Index block

An SSTable has an index block (sparse index) listing every Nth key and its file offset. To find a key, binary search the index, then scan the block on disk.

Bloom filter optimization: Each SSTable has an optional Bloom filter. Negative lookup in Bloom filter = key is definitely not in this SSTable (skip I/O).

Real systems:

  • Cassandra: Each compaction output is an SSTable. Multiple SSTables per table, read queries check all.
  • LevelDB/RocksDB: Flush outputs SSTable. Levels contain multiple SSTables.
  • BigTable: GFS-based SSTable format with Bloom filters and index blocks.

Structure 5: LSM Tree (Log-Structured Merge Tree)

What it is: Write-optimized structure combining in-memory index (MemTable, skip list or B+ tree) + disk layers (L0, L1, L2…). Writes are always appended to MemTable (O(1)), avoiding random disk I/O. Periodically, MemTable is flushed to L0 SSTables; compaction merges overlapping layers.

Write path (simplified):

  1. Write to WAL (write-ahead log) — ensures durability
  2. Insert into MemTable (in-memory skip list)
  3. Once MemTable reaches size limit (default 64 MB in RocksDB), flush to L0 SSTable
  4. Compaction merges L0 → L1 → L2 in background

Read path:

  1. Check MemTable (O(log n))
  2. Check L0 SSTables via Bloom filter (O(k log n) where k = number of L0 tables)
  3. Check L1, L2… (each level has higher compression, fewer files)

Key metrics (RocksDB defaults):

  • MemTable size: 64 MB
  • L0 file size: 64 MB
  • Level multiplier: 10 (L1 = 10 × L0, L2 = 10 × L1, etc.)
  • Compaction strategy: Leveled (each level stores 10× previous level size, ensuring bounded read amplification)

Compaction strategies:

Size-tiered compaction: Compact tables when count exceeds threshold. Simpler but unpredictable write amplification (50-100x). Used in older Cassandra.

Leveled compaction: Each level has fixed size; compact only overlapping tables. More deterministic write amplification (~10x) but higher CPU. Used in RocksDB, modern Cassandra.

Real systems:

  • RocksDB: Leveled + Size-tiered hybrid, 10+ levels
  • Cassandra: LSM with SSTables, tunable compaction
  • InfluxDB: LSM-style for time-series writes
  • LevelDB: Google’s original LSM implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
class LSMTree:
    def __init__(self):
        self.memtable = SkipList()  # In-memory (MemTable)
        self.levels = [[], [], [], [], []]  # L0, L1, L2, L3, L4 SSTables
        self.wal = WAL()

    def write(self, key, value):
        """
        Write path: WAL -> MemTable -> SSTable (flush) -> Compaction.
        """
        # 1. Write to WAL
        self.wal.append((key, value))

        # 2. Insert into MemTable
        self.memtable.insert((key, value))

        # 3. If MemTable exceeds 64 MB, flush to L0 SSTable
        if self.memtable.size_mb > 64:
            self._flush_memtable_to_l0()

        # 4. If L0 has too many SSTables, compact to L1
        if len(self.levels[0]) > 4:
            self._compact_level(0, 1)

    def read(self, key):
        """
        Read path: MemTable -> L0 -> L1 -> L2 ...
        """
        # Check MemTable first
        if value := self.memtable.search(key):
            return value

        # Check L0 SSTables
        for sstable in self.levels[0]:
            if key in sstable.bloom_filter and key in sstable:
                return sstable[key]

        # Check L1, L2, ... (fewer tables per level, better compression)
        for level_idx in range(1, len(self.levels)):
            for sstable in self.levels[level_idx]:
                if key in sstable:
                    return sstable[key]

        return None

    def _compact_level(self, src_level, dst_level):
        """
        Compact SSTables from src_level into dst_level.
        Merge all SSTables from src_level, eliminate duplicates (keep latest), recompress.
        Why: Compaction reduces the number of files per level, enabling faster reads
        (fewer files to check) and deduplication (later writes shadow older ones).
        """
        if src_level >= len(self.levels) or not self.levels[src_level]:
            return  # Nothing to compact

        # Collect all key-value pairs from src_level SSTables
        merged_items = []
        for sstable in self.levels[src_level]:
            merged_items.extend(sstable.items)

        # Sort and deduplicate (keep last occurrence of each key)
        merged_items.sort(key=lambda x: x[0])
        deduplicated = {}
        for key, value in merged_items:
            deduplicated[key] = value  # Later writes override

        # Create new SSTable in dst_level
        new_sstable = SSTable(f"sstable_l{dst_level}_{len(self.levels[dst_level])}.db",
                               sorted(deduplicated.items()))
        self.levels[dst_level].append(new_sstable)

        # Clear src_level
        self.levels[src_level] = []

Structure 6: Inverted Index

What it is: Mapping from term → list of documents containing that term. Core of full-text search. When you search “machine learning”, the index looks up docs containing both “machine” AND “learning”.

Structure:

1
2
3
4
Term          | Postings List (compressed)
"machine"     | [Doc1, Doc3, Doc5, Doc7, ...]
"learning"    | [Doc1, Doc2, Doc4, Doc6, ...]
"algorithm"   | [Doc2, Doc5, Doc8, ...]

Compression: Postings lists are usually compressed (delta encoding, variable-byte encoding) because they’re large (billions of entries in web-scale systems).

Real systems:

  • Elasticsearch: Lucene inverted index — termdictionary (trie of terms) + compressed postings lists per segment
  • PostgreSQL full-text search: Inverted index on tsvector columns
  • Lucene: Java library, many search engines built on it
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class InvertedIndex:
    def __init__(self):
        self.index = {}  # term -> [doc_ids]

    def add_document(self, doc_id, terms):
        """Add document with given terms."""
        for term in terms:
            if term not in self.index:
                self.index[term] = []
            self.index[term].append(doc_id)

    def search(self, query_terms):
        """
        Intersect postings lists for AND query.
        E.g., search("machine learning") returns docs with BOTH terms.
        """
        if not query_terms:
            return []

        # Start with smallest postings list (optimization)
        postings_lists = [
            set(self.index.get(term, []))
            for term in query_terms
        ]

        # Intersect all lists
        result = postings_lists[0]
        for postings in postings_lists[1:]:
            result &= postings

        return sorted(result)

Structure 7: Suffix Tree

What it is: Compressed trie of all suffixes of a string. Enables O(m) substring search where m = pattern length (independent of text length). Requires O(n) space.

Use case: Find all occurrences of pattern “GCTA” in DNA sequence (millions of bases). Suffix tree makes this instant.

Trade-off: High space overhead (10-20x text size) limits practical use. Suffix arrays + binary search often used instead.

Real systems:

  • DNA sequence search — bioinformatics
  • Aho-Corasick automaton — multiple pattern matching
  • String algorithms — pattern matching in competitive programming

Note: Suffix Trees are complex enough to merit their own deep-dive topic. A full implementation requires careful handling of node compression, edge labels, failure functions, and suffix links. For this reference, the focus remains on their role in the 8-structure ecosystem.

Structure 8: R-Tree

What it is: Balanced tree for spatial indexing. Each node is a bounding box (rectangle). Child bounding boxes fit inside parent. Enables range queries like “find all restaurants within 5km of (lat=40.7, lng=-74.0)”.

Real systems:

  • PostGIS (PostgreSQL spatial extension): R-Tree indexes on geometry columns
  • SQLite: R*Tree module for spatial queries
  • Google Maps: Spatial indexing for location search

Note: R-Trees are complex to implement correctly (managing overlapping bounding boxes, balancing insertion/deletion, minimizing overlap). Like Suffix Trees, they merit their own deep-dive. The key insight is: when indexing multi-dimensional spatial data (not 1D keys), use spatial data structures instead of B+ trees.

When to Use / Avoid Each Structure

✅ Use Hash Index When

  • Exact-key lookups only (primary key lookups)
  • No range queries needed
  • Memory efficient for small tables

❌ Avoid Hash Index When

  • Need range queries (WHERE age > 30)
  • Key ordering matters

✅ Use B+ Tree When

  • Balanced read/write workload
  • Range queries are common (WHERE id BETWEEN 100 AND 200)
  • Read latency is critical
  • Data fits in available RAM for caching

❌ Avoid B+ Tree When

  • Write-heavy workload (> 50% writes) — LSM is more efficient
  • Need to avoid random I/O completely

✅ Use LSM Tree When

  • Write-heavy workload (> 50% writes)
  • Sequential I/O is cheaper than random I/O (SSDs, HDD)
  • High-volume time-series data
  • Can tolerate higher read latency (acceptable if read volume is low)

❌ Avoid LSM Tree When

  • Read latency is critical and read volume is high
  • Compaction I/O will interfere with application I/O

✅ Use Inverted Index When

  • Full-text search (keyword search across documents)
  • Relevance ranking needed
  • Scale to billions of documents

❌ Avoid Inverted Index When

  • Only structured queries (use B+ tree indexes instead)

Implementation Example: Simplified LSM Tree Write Path

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
class SimpleWAL:
    def __init__(self, path):
        self.file = open(path, 'ab')

    def append(self, key, value):
        """Append log entry to WAL."""
        self.file.write(f"{key}:{value}\n".encode())
        self.file.flush()

    def close(self):
        self.file.close()

class SimpleSSTable:
    def __init__(self, path, sorted_items):
        """Create immutable SSTable from sorted key-value pairs."""
        self.path = path
        self.items = sorted_items
        self.write_to_disk()

    def write_to_disk(self):
        """Write sorted items to disk in compressed format."""
        with open(self.path, 'wb') as f:
            for key, value in self.items:
                f.write(f"{key}:{value}\n".encode())

    def __contains__(self, key):
        return any(k == key for k, _ in self.items)

    def __getitem__(self, key):
        for k, v in self.items:
            if k == key:
                return v
        raise KeyError(key)

class SimpleLSMTree:
    def __init__(self, memtable_limit_mb=64):
        self.memtable = {}  # In-memory dict (production: use SkipList)
        self.memtable_limit = memtable_limit_mb * 1024 * 1024
        self.levels = [[], [], [], []]  # L0, L1, L2, L3
        self.wal = SimpleWAL("wal.log")
        self.table_count = 0

    def write(self, key, value):
        """Write key-value pair: WAL -> MemTable -> Flush -> Compact."""
        # Write to WAL for durability
        self.wal.append(key, value)

        # Write to MemTable
        self.memtable[key] = value

        # Check if MemTable exceeds limit
        if self._memtable_size() > self.memtable_limit:
            self._flush_memtable_to_l0()

    def _memtable_size(self):
        """Estimate MemTable size in bytes."""
        return sum(len(k) + len(str(v)) for k, v in self.memtable.items())

    def _flush_memtable_to_l0(self):
        """Flush sorted MemTable to L0 SSTable."""
        sorted_items = sorted(self.memtable.items())
        table_path = f"sstable_l0_{self.table_count}.db"
        self.levels[0].append(SimpleSSTable(table_path, sorted_items))
        self.memtable = {}
        self.table_count += 1

        # Check if L0 exceeds threshold, compact to L1
        if len(self.levels[0]) > 4:
            self._compact_level(0, 1)

    def _compact_level(self, src_level, dst_level):
        """
        Merge all SSTables from src_level into a single SSTable in dst_level.
        Deduplicates keys (later values shadow older ones) and re-sorts.
        Why: LSM compaction is the core of write optimization. It consolidates
        fragmented SSTables, reduces read amplification, and enables space reclamation.
        """
        if src_level >= len(self.levels) or not self.levels[src_level]:
            return  # Nothing to compact

        # Merge all items from all SSTables in src_level
        merged_items = []
        for sstable in self.levels[src_level]:
            merged_items.extend(sstable.items)

        # Sort and deduplicate (keep only the latest value for each key)
        merged_items.sort(key=lambda x: x[0])
        deduplicated = {}
        for key, value in merged_items:
            deduplicated[key] = value

        # Write deduplicated, sorted items to new SSTable in dst_level
        new_table_path = f"sstable_l{dst_level}_{len(self.levels[dst_level])}.db"
        new_sstable = SimpleSSTable(new_table_path, sorted(deduplicated.items()))
        self.levels[dst_level].append(new_sstable)

        # Clear src_level (old SSTables are now replaced)
        self.levels[src_level] = []

    def read(self, key):
        """Read key-value pair: MemTable -> L0 -> L1 -> L2 -> L3."""
        # Check MemTable first
        if key in self.memtable:
            return self.memtable[key]

        # Check each level
        for level_idx, tables in enumerate(self.levels):
            for table in tables:
                if key in table:
                    return table[key]

        raise KeyError(f"Key {key} not found")

References

This post is licensed under CC BY 4.0 by the author.