Database Internals -- Key Data Structures
The 8 fundamental data structures that power how modern databases store, index, retrieve, and compact data -- from in-memory indexes (skip lists, hash tables) to disk structures (B+ trees, SSTables, LSM trees). Each trades off read speed vs write speed, and understanding these trade-offs is critical for system design.
The 8 fundamental data structures that power how modern databases store, index, retrieve, and compact data — from in-memory indexes (skip lists, hash tables) to disk structures (B+ trees, SSTables, LSM trees). Each trades off read speed vs write speed, and understanding these trade-offs is critical for system design.
The 8 Core Structures at a Glance
| # | Structure | In Memory? | Disk? | Complexity | Best For | Used In |
|---|---|---|---|---|---|---|
| 1 | Skip List | ✅ | ❌ | O(log n) search | Sorted in-memory index | Redis sorted sets, LevelDB MemTable |
| 2 | Hash Index | ✅ | ❌ | O(1) exact match | In-memory lookups | Hash tables, all languages, in-memory KV |
| 3 | B+ Tree | ❌ | ✅ | O(log n) seek | Balanced disk reads | PostgreSQL, MySQL, SQLite, MongoDB |
| 4 | SSTable | ❌ | ✅ | O(log n) via index | Immutable sorted file | Cassandra, LevelDB, RocksDB write path |
| 5 | LSM Tree | ✅+❌ | ✅ | O(1) write, O(log n) read | Write-heavy workloads | RocksDB, Cassandra, InfluxDB, LevelDB |
| 6 | Inverted Index | ❌ | ✅ | O(k) term lookup | Full-text search | Elasticsearch, Lucene, PostgreSQL FTS |
| 7 | Suffix Tree | ❌ | ✅ | O(m) pattern match | Substring search | DNA sequencing, Aho-Corasick, grep |
| 8 | R-Tree | ❌ | ✅ | O(log n) spatial | Multi-dimensional range | PostGIS, SQLite, location search |
Structure 1: Skip List
What it is: Probabilistic data structure that mirrors a sorted linked list but with “express lanes” (higher levels) that skip ahead. Provides O(log n) search by skipping roughly n/2, n/4, n/8… elements.
Key insight: Skip lists are simpler than balanced trees (no rotations) but same complexity. Uses randomization to keep levels balanced without complex rebalancing logic.
Real systems:
- Redis Sorted Sets: Primary implementation for ZADD, ZRANGE, ZRANK operations. Each sorted set member is a (score, value) pair indexed by both score (skip list) and value (hash table).
- LevelDB MemTable: Writes go into an in-memory skip list (MemTable) before flushing to disk as SSTable. Provides fast insertion and range iteration.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import random
class SkipNode:
def __init__(self, value, level):
self.value = value
self.forward = [None] * level
self.level = level
class SkipList:
def __init__(self, max_level=16, p=0.5):
self.max_level = max_level
self.p = p
self.header = SkipNode(None, max_level)
self.level = 1
def _random_level(self):
"""Generate random level for new node."""
level = 1
while random.random() < self.p and level < self.max_level:
level += 1
return level
def insert(self, value):
"""Insert value into skip list."""
new_level = self._random_level()
if new_level > self.level:
self.level = new_level
new_node = SkipNode(value, new_level)
update = [self.header] * self.max_level
# Find insertion position
x = self.header
for i in range(self.level - 1, -1, -1):
while x.forward[i] and x.forward[i].value < value:
x = x.forward[i]
update[i] = x
# Insert at all levels
for i in range(new_level):
new_node.forward[i] = update[i].forward[i]
update[i].forward[i] = new_node
def search(self, value):
"""Search for value — O(log n)."""
x = self.header
for i in range(self.level - 1, -1, -1):
while x.forward[i] and x.forward[i].value < value:
x = x.forward[i]
x = x.forward[0]
return x and x.value == value
Structure 2: Hash Index
What it is: Standard hash table (dictionary/map). O(1) average-case lookup, insertion, deletion. Used for exact-key lookups, not range queries.
Key trade-off: Fast exact-match, but no range capability (unlike tree indexes). Hash collisions degrade to O(n) in pathological cases (mitigated by good hash functions and load factor management).
Real systems:
- All hash maps/dicts — Python dict, Java HashMap, Go map, Rust HashMap
- PostgreSQL hash indexes — For equality lookups only (not range queries)
- Redis Hashes (HSET/HGET) — Each hash field is a hash table entry
Structure 3: B+ Tree
What it is: Balanced disk-based tree where all keys are stored in leaves, and internal nodes are index pointers. Fanout of 100-200 means a 4-level tree holds millions of keys. All reads access the same number of disk pages (O(log n) pages).
Key property: Leaf pages are linked — allows efficient range queries (e.g., SELECT * FROM users WHERE age BETWEEN 30 AND 40). Internal nodes store only index boundaries, not data.
Page structure (PostgreSQL example):
- Page size: 8 KB (on disk)
- Fanout: ~200 keys per internal node, ~100 data rows per leaf
- Fill factor: Default 90% (leaves filled to 90% to reduce fragmentation)
- Example: A 4-level B+ tree with fanout 200 holds 200 × 200 × 200 × 100 = 800M rows, requiring only 4 disk seeks in worst case
Real systems:
- PostgreSQL: B+ tree for all primary and secondary indexes
- MySQL InnoDB: B+ tree for clustered index (table data lives in leaves)
- SQLite: B+ tree implementation with 4KB pages
- MongoDB: B+ tree for indexes (document IDs are sorted)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
class BTreeNode:
"""
A node in a B-Tree. Each node has keys (sorted) and children (if internal).
Order is the branching factor: each internal node has at most `order` children.
Each node is written to a single disk page for locality.
"""
def __init__(self, leaf=True, order=3):
self.keys = []
self.children = [] # Only for internal nodes
self.leaf = leaf
self.order = order
self.max_keys = order - 1
def is_full(self):
"""Check if node has reached maximum capacity."""
return len(self.keys) >= self.max_keys
def split_child(self, i):
"""
Split the full child at index i.
Move the middle key up to parent, split child into two nodes.
Why: Keeps tree balanced — prevents chains of single keys.
"""
child = self.children[i]
mid_idx = self.max_keys // 2
# Create new right sibling
right = BTreeNode(leaf=child.leaf, order=self.order)
right.keys = child.keys[mid_idx + 1:]
if not child.leaf:
right.children = child.children[mid_idx + 1:]
# Trim left child and move middle key up
mid_key = child.keys[mid_idx]
child.keys = child.keys[:mid_idx]
if not child.leaf:
child.children = child.children[:mid_idx + 1]
# Insert middle key into parent
self.keys.insert(i, mid_key)
self.children.insert(i + 1, right)
def insert_non_full(self, key):
"""
Insert key into this non-full node.
If leaf, insert directly. If internal, recurse into appropriate child.
Why: Insertion always happens in leaves first; propagates upward on splits.
"""
idx = len(self.keys) - 1
if self.leaf:
# Insert directly into leaf (maintaining sort order)
self.keys.append(None)
while idx >= 0 and self.keys[idx] > key:
self.keys[idx + 1] = self.keys[idx]
idx -= 1
self.keys[idx + 1] = key
else:
# Find child to recurse into
while idx >= 0 and self.keys[idx] > key:
idx -= 1
idx += 1
# If child is full, split it first
if self.children[idx].is_full():
self.split_child(idx)
if self.keys[idx] < key:
idx += 1
# Recurse
self.children[idx].insert_non_full(key)
def search(self, key):
"""Search for key in this subtree (O(log n) pages)."""
idx = 0
while idx < len(self.keys) and key > self.keys[idx]:
idx += 1
if idx < len(self.keys) and key == self.keys[idx]:
return True # Found
if self.leaf:
return False # Not found
return self.children[idx].search(key)
class BTree:
"""
B-Tree database index. All insertions/searches access O(log n) disk pages.
Order (fanout) of 100-200 ensures that a 4-level tree holds millions of rows.
Real systems: PostgreSQL, MySQL InnoDB, SQLite all use B-Trees (or B+ variants).
"""
def __init__(self, order=3):
self.root = BTreeNode(leaf=True, order=order)
self.order = order
def insert(self, key):
"""Insert key into tree (O(log n) disk operations)."""
if self.root.is_full():
# Split root if needed
new_root = BTreeNode(leaf=False, order=self.order)
new_root.children.append(self.root)
new_root.split_child(0)
self.root = new_root
self.root.insert_non_full(key)
def search(self, key):
"""Search for key (O(log n) disk operations)."""
return self.root.search(key)
B-Tree Properties in Production:
- Page size: 8 KB (PostgreSQL) or 4 KB (SQLite) — determines fanout and depth
- Fanout: ~100-200 keys per page (determines tree depth: O(log n))
- Durability: Pages are fsync’d to disk on commit; survives crashes
- Write amplification: Each insert may trigger page splits (small overhead)
Structure 4: SSTable (Sorted String Table)
What it is: Immutable, sorted key-value file on disk. Keys are in ascending order; values are arbitrary. Typically 2-64 MB in size. Used as the output of flushing in-memory data to disk (e.g., RocksDB flush, Cassandra flush).
Structure on disk:
1
2
3
[Key1, Value1] [Key2, Value2] [Key3, Value3] ...
^ ^ ^
Index block Index block Index block
An SSTable has an index block (sparse index) listing every Nth key and its file offset. To find a key, binary search the index, then scan the block on disk.
Bloom filter optimization: Each SSTable has an optional Bloom filter. Negative lookup in Bloom filter = key is definitely not in this SSTable (skip I/O).
Real systems:
- Cassandra: Each compaction output is an SSTable. Multiple SSTables per table, read queries check all.
- LevelDB/RocksDB: Flush outputs SSTable. Levels contain multiple SSTables.
- BigTable: GFS-based SSTable format with Bloom filters and index blocks.
Structure 5: LSM Tree (Log-Structured Merge Tree)
What it is: Write-optimized structure combining in-memory index (MemTable, skip list or B+ tree) + disk layers (L0, L1, L2…). Writes are always appended to MemTable (O(1)), avoiding random disk I/O. Periodically, MemTable is flushed to L0 SSTables; compaction merges overlapping layers.
Write path (simplified):
- Write to WAL (write-ahead log) — ensures durability
- Insert into MemTable (in-memory skip list)
- Once MemTable reaches size limit (default 64 MB in RocksDB), flush to L0 SSTable
- Compaction merges L0 → L1 → L2 in background
Read path:
- Check MemTable (O(log n))
- Check L0 SSTables via Bloom filter (O(k log n) where k = number of L0 tables)
- Check L1, L2… (each level has higher compression, fewer files)
Key metrics (RocksDB defaults):
- MemTable size: 64 MB
- L0 file size: 64 MB
- Level multiplier: 10 (L1 = 10 × L0, L2 = 10 × L1, etc.)
- Compaction strategy: Leveled (each level stores 10× previous level size, ensuring bounded read amplification)
Compaction strategies:
Size-tiered compaction: Compact tables when count exceeds threshold. Simpler but unpredictable write amplification (50-100x). Used in older Cassandra.
Leveled compaction: Each level has fixed size; compact only overlapping tables. More deterministic write amplification (~10x) but higher CPU. Used in RocksDB, modern Cassandra.
Real systems:
- RocksDB: Leveled + Size-tiered hybrid, 10+ levels
- Cassandra: LSM with SSTables, tunable compaction
- InfluxDB: LSM-style for time-series writes
- LevelDB: Google’s original LSM implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
class LSMTree:
def __init__(self):
self.memtable = SkipList() # In-memory (MemTable)
self.levels = [[], [], [], [], []] # L0, L1, L2, L3, L4 SSTables
self.wal = WAL()
def write(self, key, value):
"""
Write path: WAL -> MemTable -> SSTable (flush) -> Compaction.
"""
# 1. Write to WAL
self.wal.append((key, value))
# 2. Insert into MemTable
self.memtable.insert((key, value))
# 3. If MemTable exceeds 64 MB, flush to L0 SSTable
if self.memtable.size_mb > 64:
self._flush_memtable_to_l0()
# 4. If L0 has too many SSTables, compact to L1
if len(self.levels[0]) > 4:
self._compact_level(0, 1)
def read(self, key):
"""
Read path: MemTable -> L0 -> L1 -> L2 ...
"""
# Check MemTable first
if value := self.memtable.search(key):
return value
# Check L0 SSTables
for sstable in self.levels[0]:
if key in sstable.bloom_filter and key in sstable:
return sstable[key]
# Check L1, L2, ... (fewer tables per level, better compression)
for level_idx in range(1, len(self.levels)):
for sstable in self.levels[level_idx]:
if key in sstable:
return sstable[key]
return None
def _compact_level(self, src_level, dst_level):
"""
Compact SSTables from src_level into dst_level.
Merge all SSTables from src_level, eliminate duplicates (keep latest), recompress.
Why: Compaction reduces the number of files per level, enabling faster reads
(fewer files to check) and deduplication (later writes shadow older ones).
"""
if src_level >= len(self.levels) or not self.levels[src_level]:
return # Nothing to compact
# Collect all key-value pairs from src_level SSTables
merged_items = []
for sstable in self.levels[src_level]:
merged_items.extend(sstable.items)
# Sort and deduplicate (keep last occurrence of each key)
merged_items.sort(key=lambda x: x[0])
deduplicated = {}
for key, value in merged_items:
deduplicated[key] = value # Later writes override
# Create new SSTable in dst_level
new_sstable = SSTable(f"sstable_l{dst_level}_{len(self.levels[dst_level])}.db",
sorted(deduplicated.items()))
self.levels[dst_level].append(new_sstable)
# Clear src_level
self.levels[src_level] = []
Structure 6: Inverted Index
What it is: Mapping from term → list of documents containing that term. Core of full-text search. When you search “machine learning”, the index looks up docs containing both “machine” AND “learning”.
Structure:
1
2
3
4
Term | Postings List (compressed)
"machine" | [Doc1, Doc3, Doc5, Doc7, ...]
"learning" | [Doc1, Doc2, Doc4, Doc6, ...]
"algorithm" | [Doc2, Doc5, Doc8, ...]
Compression: Postings lists are usually compressed (delta encoding, variable-byte encoding) because they’re large (billions of entries in web-scale systems).
Real systems:
- Elasticsearch: Lucene inverted index — termdictionary (trie of terms) + compressed postings lists per segment
- PostgreSQL full-text search: Inverted index on tsvector columns
- Lucene: Java library, many search engines built on it
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class InvertedIndex:
def __init__(self):
self.index = {} # term -> [doc_ids]
def add_document(self, doc_id, terms):
"""Add document with given terms."""
for term in terms:
if term not in self.index:
self.index[term] = []
self.index[term].append(doc_id)
def search(self, query_terms):
"""
Intersect postings lists for AND query.
E.g., search("machine learning") returns docs with BOTH terms.
"""
if not query_terms:
return []
# Start with smallest postings list (optimization)
postings_lists = [
set(self.index.get(term, []))
for term in query_terms
]
# Intersect all lists
result = postings_lists[0]
for postings in postings_lists[1:]:
result &= postings
return sorted(result)
Structure 7: Suffix Tree
What it is: Compressed trie of all suffixes of a string. Enables O(m) substring search where m = pattern length (independent of text length). Requires O(n) space.
Use case: Find all occurrences of pattern “GCTA” in DNA sequence (millions of bases). Suffix tree makes this instant.
Trade-off: High space overhead (10-20x text size) limits practical use. Suffix arrays + binary search often used instead.
Real systems:
- DNA sequence search — bioinformatics
- Aho-Corasick automaton — multiple pattern matching
- String algorithms — pattern matching in competitive programming
Note: Suffix Trees are complex enough to merit their own deep-dive topic. A full implementation requires careful handling of node compression, edge labels, failure functions, and suffix links. For this reference, the focus remains on their role in the 8-structure ecosystem.
Structure 8: R-Tree
What it is: Balanced tree for spatial indexing. Each node is a bounding box (rectangle). Child bounding boxes fit inside parent. Enables range queries like “find all restaurants within 5km of (lat=40.7, lng=-74.0)”.
Real systems:
- PostGIS (PostgreSQL spatial extension): R-Tree indexes on geometry columns
- SQLite: R*Tree module for spatial queries
- Google Maps: Spatial indexing for location search
Note: R-Trees are complex to implement correctly (managing overlapping bounding boxes, balancing insertion/deletion, minimizing overlap). Like Suffix Trees, they merit their own deep-dive. The key insight is: when indexing multi-dimensional spatial data (not 1D keys), use spatial data structures instead of B+ trees.
When to Use / Avoid Each Structure
✅ Use Hash Index When
- Exact-key lookups only (primary key lookups)
- No range queries needed
- Memory efficient for small tables
❌ Avoid Hash Index When
- Need range queries (
WHERE age > 30) - Key ordering matters
✅ Use B+ Tree When
- Balanced read/write workload
- Range queries are common (
WHERE id BETWEEN 100 AND 200) - Read latency is critical
- Data fits in available RAM for caching
❌ Avoid B+ Tree When
- Write-heavy workload (> 50% writes) — LSM is more efficient
- Need to avoid random I/O completely
✅ Use LSM Tree When
- Write-heavy workload (> 50% writes)
- Sequential I/O is cheaper than random I/O (SSDs, HDD)
- High-volume time-series data
- Can tolerate higher read latency (acceptable if read volume is low)
❌ Avoid LSM Tree When
- Read latency is critical and read volume is high
- Compaction I/O will interfere with application I/O
✅ Use Inverted Index When
- Full-text search (keyword search across documents)
- Relevance ranking needed
- Scale to billions of documents
❌ Avoid Inverted Index When
- Only structured queries (use B+ tree indexes instead)
Implementation Example: Simplified LSM Tree Write Path
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
class SimpleWAL:
def __init__(self, path):
self.file = open(path, 'ab')
def append(self, key, value):
"""Append log entry to WAL."""
self.file.write(f"{key}:{value}\n".encode())
self.file.flush()
def close(self):
self.file.close()
class SimpleSSTable:
def __init__(self, path, sorted_items):
"""Create immutable SSTable from sorted key-value pairs."""
self.path = path
self.items = sorted_items
self.write_to_disk()
def write_to_disk(self):
"""Write sorted items to disk in compressed format."""
with open(self.path, 'wb') as f:
for key, value in self.items:
f.write(f"{key}:{value}\n".encode())
def __contains__(self, key):
return any(k == key for k, _ in self.items)
def __getitem__(self, key):
for k, v in self.items:
if k == key:
return v
raise KeyError(key)
class SimpleLSMTree:
def __init__(self, memtable_limit_mb=64):
self.memtable = {} # In-memory dict (production: use SkipList)
self.memtable_limit = memtable_limit_mb * 1024 * 1024
self.levels = [[], [], [], []] # L0, L1, L2, L3
self.wal = SimpleWAL("wal.log")
self.table_count = 0
def write(self, key, value):
"""Write key-value pair: WAL -> MemTable -> Flush -> Compact."""
# Write to WAL for durability
self.wal.append(key, value)
# Write to MemTable
self.memtable[key] = value
# Check if MemTable exceeds limit
if self._memtable_size() > self.memtable_limit:
self._flush_memtable_to_l0()
def _memtable_size(self):
"""Estimate MemTable size in bytes."""
return sum(len(k) + len(str(v)) for k, v in self.memtable.items())
def _flush_memtable_to_l0(self):
"""Flush sorted MemTable to L0 SSTable."""
sorted_items = sorted(self.memtable.items())
table_path = f"sstable_l0_{self.table_count}.db"
self.levels[0].append(SimpleSSTable(table_path, sorted_items))
self.memtable = {}
self.table_count += 1
# Check if L0 exceeds threshold, compact to L1
if len(self.levels[0]) > 4:
self._compact_level(0, 1)
def _compact_level(self, src_level, dst_level):
"""
Merge all SSTables from src_level into a single SSTable in dst_level.
Deduplicates keys (later values shadow older ones) and re-sorts.
Why: LSM compaction is the core of write optimization. It consolidates
fragmented SSTables, reduces read amplification, and enables space reclamation.
"""
if src_level >= len(self.levels) or not self.levels[src_level]:
return # Nothing to compact
# Merge all items from all SSTables in src_level
merged_items = []
for sstable in self.levels[src_level]:
merged_items.extend(sstable.items)
# Sort and deduplicate (keep only the latest value for each key)
merged_items.sort(key=lambda x: x[0])
deduplicated = {}
for key, value in merged_items:
deduplicated[key] = value
# Write deduplicated, sorted items to new SSTable in dst_level
new_table_path = f"sstable_l{dst_level}_{len(self.levels[dst_level])}.db"
new_sstable = SimpleSSTable(new_table_path, sorted(deduplicated.items()))
self.levels[dst_level].append(new_sstable)
# Clear src_level (old SSTables are now replaced)
self.levels[src_level] = []
def read(self, key):
"""Read key-value pair: MemTable -> L0 -> L1 -> L2 -> L3."""
# Check MemTable first
if key in self.memtable:
return self.memtable[key]
# Check each level
for level_idx, tables in enumerate(self.levels):
for table in tables:
if key in table:
return table[key]
raise KeyError(f"Key {key} not found")
References
- The Log-Structured Merge-Tree — O’Neil et al. (1996) — Foundational LSM paper; explains why sequential writes beat random I/O.
- RocksDB Architecture and Design — Practical LSM with compaction strategies.
- PostgreSQL Index Types — Official Docs — B+ tree, hash, GiST, BRIN indexes.
- ByteByteGo — 8 Key Data Structures That Power Modern Databases — Visual walkthrough of all 8 structures.
- Cassandra SSTable Format — DataStax — SSTable structure, Bloom filters, compaction.
- Lucene Inverted Index — Elasticsearch index architecture.
- Wikipedia: B+ Tree — Balanced tree properties, fanout, fill factor.
- Wikipedia: Skip List — Probabilistic data structure, complexity analysis.