Database Indexing
An index trades write overhead and storage for dramatically faster reads by organizing data in a sortable structure -- choose the index type and key columns based on your query patterns.
Key Properties
| Property | B-Tree Index | LSM-Tree Index | Hash Index | Inverted Index |
|---|---|---|---|---|
| Point lookup | O(log n) | O(log n) | O(1) | O(n) |
| Range query | O(log n + k) | O(log n + k) | N/A | O(k) |
| Insert | O(log n) | O(1) amortized | O(1) | O(1) |
| Delete | O(log n) | O(log n) | O(1) | O(1) |
| Space overhead | ~10–20% | 1.5–3x (compaction) | ~1.2x | 2–5x |
| Best for | OLTP, ranges | Write-heavy | Exact matches | Full-text search |
| Read latency | ~0.01–0.1ms | ~0.1–1ms | ~0.001ms | ~0.5–5ms |
B-Tree vs LSM-Tree Index Trade-offs
B-Tree Index (Most Common — PostgreSQL, MySQL, MongoDB)
Mechanism: Maintains a balanced tree structure where each internal node branches to ~300 children (PostgreSQL uses 8KB page size). Keys are sorted; leaf pages contain pointers to row data. When you insert a key into a full node, the node splits into two nodes, pushing the median key upward.
Page splits cost: When a leaf page overflows (e.g., adding the 367th key to a 366-key page), PostgreSQL writes: (1) the original leaf (now ~50% full), (2) new sibling leaf (~50% full), (3) parent internal node update, (4) potential cascading parent splits. This is why B-Tree insert is “expensive” (~100 microseconds per insert with disk I/O), but read performance dominates because a 10-million-row table needs only ~3 disk seeks.
Real-world B-Tree shape: PostgreSQL B-Tree on a 1-billion row table (8 bytes per key + 6 bytes overhead per entry = 14 bytes/entry). A page holds ~570 entries. Internal nodes fan out to ~300 children. Depth: log₃₀₀(1B) ≈ 3–4 pages. A point lookup touches ~3 pages (root→internal→leaf).
When it dominates: OLTP databases where you read far more often than write. A 100:1 read/write ratio makes B-Tree’s slower writes worth the faster reads. Range queries (WHERE timestamp BETWEEN '2024-01-01' AND '2024-12-31') are efficient because consecutive keys are physically adjacent in leaf pages.
LSM-Tree Index (Cassandra, RocksDB, LevelDB, HBase)
Mechanism: Writes don’t touch the index structure directly. Instead, writes go to a fast in-memory structure (MemTable), then periodically flush to immutable SSTable files on disk organized in levels. Each level contains non-overlapping key ranges. On read, you check MemTable first, then L0 (newest SSTable files), then L1, L2, etc. until you find the key.
Write amplification: A write that goes through all levels incurs “amplification.” With 10 levels, a single write might be rewritten 10 times as it moves from MemTable→L0→L1→…→L9 during compaction. However, writes are sequential (append to MemTable, then SSTable files), which is 10–100x faster than random B-Tree page updates.
Read latency variance: LSM reads are slower and have higher variance. Finding a key in L9 requires checking and potentially reading from multiple files. With Bloom filters on each SSTable, you skip most files, but worst-case reads can touch 10+ files. B-Tree always touches ~3 pages, making latency predictable.
Real-world LSM usage: RocksDB at Meta/Facebook handles 1000x more writes than reads (logs, event streams). A write that would take 100 microseconds in B-Tree takes 10 microseconds in RocksDB because it’s just appending to MemTable. The cost is deferred: background compaction threads rewrite old data, but this happens off the critical path.
Trade-off Summary
| Scenario | Winner | Why |
|---|---|---|
| 10:1 read/write ratio | B-Tree | Amortize index maintenance cost over many reads. |
| 1:10 read/write ratio | LSM-Tree | Sequential writes dominate; read latency is acceptable. |
| Latency-sensitive reads (p99 < 10ms) | B-Tree | Predictable read path; no variance. |
| Throughput-sensitive writes (1M+ ops/sec) | LSM-Tree | Sequential I/O is 10x faster. |
| Point lookups only | Hash Index | O(1) is unbeatable; don’t need range queries. |
| Full-text search | Inverted Index | Only viable option; structured for term→document mapping. |
Index Types by Access Pattern
1. B-Tree Index (Default — Use First)
What: Sorted tree structure on disk; supports range queries and prefix matching.
When to use:
- ✅ Read-heavy workloads (OLTP)
- ✅ Range queries (
WHERE created_at > '2024-01-01') - ✅ Prefix queries (
WHERE name LIKE 'Al%') - ✅ Ordering without sort (
ORDER BYuses index)
Real systems: PostgreSQL primary index, MySQL InnoDB, MongoDB (WiredTiger engine)
2. Hash Index
What: Direct hash of key to byte offset; no ordering.
When to use:
- ✅ Exact equality lookups only (
WHERE id = 42) - ✅ In-memory caches (Redis HSET)
- ✅ High cardinality columns
Avoid:
- ❌ Range queries
- ❌ LIKE/prefix matching
- ❌ ORDER BY
Real systems: Redis hash tables, Memcached, in-memory hash maps
3. Inverted Index
What: Maps each unique term/word to a list of document IDs (or positions) containing that term.
Structure:
1
2
Term "distributed" → [doc_1, doc_5, doc_12, ...]
Term "systems" → [doc_1, doc_3, doc_7, ...]
When to use:
- ✅ Full-text search (
WHERE body @@ 'distributed systems') - ✅ Multi-word queries with boolean operators (AND, OR, NOT)
- ✅ Phrase queries (
"exact phrase")
Real systems: Elasticsearch (Lucene), PostgreSQL tsvector, MySQL FULLTEXT
4. Bitmap Index
What: For each distinct value, maintain a bitmap of rows containing that value. Useful for low-cardinality columns.
Example (status column with 3 values):
1
2
3
'active' → 1011010101... (bit i is 1 if row i has status='active')
'inactive' → 0100101010...
'pending' → 0000000001...
When to use:
- ✅ Low-cardinality columns (< 100 distinct values)
- ✅ Columns that are frequently filtered (status, country, month)
- ✅ Complex boolean filters (
status='active' AND country='DE')
Space efficiency: A bitmap index on a 100-million row table with 5 status values uses only 6.25 MB (100M bits / 8 bytes per bit), vs 400 MB for a B-Tree index.
Real systems: Oracle (supports bitmap indexes natively), Vertica, Druid
5. Geospatial Index (R-Tree, Quadtree, Geohash)
What: Organizes spatial data (lat/lng points) into hierarchical bounding boxes or quadrants.
R-Tree structure: Each internal node is a bounding rectangle containing child rectangles. A point query finds the bounding box, then recursively searches children.
Geohash variant (Uber, Redis GEO): Encodes lat/lng as interleaved bits, creating a Z-order curve (Hilbert curve). Close geographic points hash to nearby strings, enabling prefix-based range queries.
When to use:
- ✅ Geographic queries (
ST_Distance(location, '(0,0)') < 1km) - ✅ Bounding box queries (
ST_Within(location, bbox)) - ✅ Nearest-neighbor search
Real systems: PostGIS (PostgreSQL), MongoDB (2dsphere), Redis (GEOADD), Google S2
6. Covering Index (Nonkey Index)
What: A B-Tree index that includes non-key columns, allowing the entire result set to be read from the index leaf pages without touching the main table.
Example:
1
2
3
CREATE INDEX idx_user_name_email ON users(user_id, email) INCLUDE (name, created_at);
-- Query: SELECT name, email FROM users WHERE user_id = 42
-- Can be satisfied entirely from index — no table lookup needed
Cost: Covering indexes are larger (include extra columns), so they increase write cost and memory usage, but eliminate table I/O for qualifying queries.
Real systems: PostgreSQL INCLUDE clause (v11+), SQL Server, MySQL covering indexes
7. Partial Index (Filtered Index)
What: Index only rows matching a WHERE condition, reducing index size and write cost.
Example:
1
2
3
CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';
-- Indexes only 10% of users (most are inactive)
-- 90% smaller index size, 90% fewer index updates
When to use:
- ✅ Most queries filter by the same condition
- ✅ Skewed data (99% inactive, 1% active)
Real systems: PostgreSQL, SQL Server (filtered indexes)
8. Composite Index (Multi-Column)
What: Index on multiple columns in a specific order. Supports leftmost-prefix matching.
Left-prefix rule example:
1
2
3
4
5
6
7
-- Index on (country, city, name)
-- ✅ USES index: WHERE country = 'DE'
-- ✅ USES index: WHERE country = 'DE' AND city = 'Berlin'
-- ✅ USES index: WHERE country = 'DE' AND city = 'Berlin' AND name = 'Alice'
-- ❌ DOESN'T USE: WHERE city = 'Berlin' (no leftmost prefix 'country')
-- ❌ DOESN'T USE: WHERE name = 'Alice' (skip both prefixes)
Why: B-Tree keeps keys sorted left-to-right. The first column defines the major sort order; you can’t efficiently search by the 2nd column if you don’t know the 1st.
Cost: Each column added increases index size and write overhead. PostgreSQL recommends max 3–4 columns per index.
How Real Systems Use Indexing
PostgreSQL B-Tree Internals
PostgreSQL uses a B+ Tree (keys in internal nodes, data in leaf nodes) with configurable page size (default 8KB). Each page can hold ~300 keys (PostgreSQL B-Tree format: 4-byte key + 4-byte offset per entry = 8 bytes + overhead). When you insert a key into a full leaf page (366th key into a 366-entry page), PostgreSQL splits the page: new internal node links both leaves, and the median key is pushed to the parent. On a billion-row table, the tree has depth 3–4. A point lookup performs 3 random seeks to disk (root→internal→leaf), each taking ~10 milliseconds on spinning disk (or ~0.01ms on SSD). Range queries are efficient because consecutive keys in leaf pages point to consecutive rows (clustered index), so scanning 1000 rows sequentially after the first lookup costs only 1–2 additional seeks. This is why PostgreSQL dominates OLTP: predictable read latency and excellent range query performance.
MySQL InnoDB Page Splits
MySQL InnoDB B-Tree is organized by 16KB pages (larger than PostgreSQL). When a page splits, InnoDB doesn’t always split 50/50; it uses a bias toward the direction of insertion. If you’re inserting monotonically increasing keys (timestamps), InnoDB splits newly-filled pages 10/90 (old page 90% full, new page 10% full), reducing future splits when you insert the next keys into the 10% side. This optimization reduces write amplification for time-series data. A production MySQL server might see 10,000 INSERTs/sec on a primary key; without this bias, each insertion would trigger expensive page reorganization. With the bias, only 1% of inserts actually trigger splits, deferring cost.
Cassandra SSTable Indexes
Cassandra uses LSM-Tree with per-SSTable Bloom filters and partition indexes. Each SSTable file contains a summary of which partition keys it holds (a sparse index — 1 entry per 128KB of SSTable). When reading a key, Cassandra checks the partition summary to filter SSTables, then checks the Bloom filter to avoid disk I/O if the key is definitely not in that SSTable. For a write-heavy table with thousands of SSTables, this combination allows negative lookups (checking 1000 SSTables for a missing key) to be answered without touching disk. The Bloom filter is sized for ~1% false positive rate (10 bits per key), costing only ~1.25 bytes per entry, yet saving 99% of disk seeks for reads that miss.
Elasticsearch Inverted Index Structure
Elasticsearch (built on Lucene) creates a term dictionary mapping each unique word to a posting list (list of documents + term positions). For the term “distributed,” the posting list might be [doc_1@pos_5, doc_5@pos_12, doc_12@pos_2, …]. The dictionary itself is a FST (Finite State Transducer) — a compressed data structure allowing O(log n) prefix lookups. For a 1-billion document corpus with 50GB of text, the inverted index is typically 1–2GB (2–4% of source data). A full-text search query (“distributed systems”) is intersected: terms are looked up in the dictionary, posting lists are merged (documents appearing in both “distributed” AND “systems”), then results are scored by relevance (TF-IDF). This query that would require scanning all 50GB of source data in a sequential scan completes in <100ms using the inverted index.
PostGIS R-Tree for Geospatial
PostGIS maintains an R-Tree index on geographic columns, organizing spatial objects into hierarchical bounding rectangles. For a table of 10-million geographic points, the R-Tree has depth ~5–6 levels. A nearest-neighbor query (SELECT * FROM locations ORDER BY location <-> point(0,0) LIMIT 10) uses the R-Tree to traverse the tree in distance order, finding the 10 closest points without checking all 10 million. On real production data, this query completes in 1–2ms, whereas a full-table scan would require 100+ seconds. Bounding box queries (SELECT * FROM polygons WHERE ST_Contains(polygon, point(0,0))) use the R-Tree to eliminate 99%+ of polygons before performing expensive geometric calculations.
Redis Sorted Set (Skip List Index)
Redis Sorted Set is implemented as a skip list (a probabilistic balanced structure), not a B-Tree. Each element has a score, and the skip list maintains elements in sorted order by score, with ~log(n) levels of shortcuts. A 1-million element sorted set uses ~1.5 levels of skipping. Range queries (ZRANGEBYSCORE myzset 10 20) are O(log n) to find the start, then O(k) to return k elements. This is faster than B-Tree for in-memory workloads because skip lists require fewer pointer chases (linked list vs tree navigation). Redis uses this for leaderboards (score = player ranking), rate limiting buckets (score = timestamp), and pub/sub (score = message sequence).
Implementation — B-Tree Index Simulation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
import bisect
from typing import List, Tuple, Optional
class BTreeNode:
"""
Represents a node in a B-Tree.
Each node contains:
- keys: sorted list of key values
- children: list of child nodes (empty if leaf)
- is_leaf: boolean indicating if this is a leaf node
"""
def __init__(self, is_leaf=True, max_keys=3):
"""
Initialize a B-Tree node.
Args:
is_leaf: whether this is a leaf node
max_keys: maximum keys per node (controls branching factor)
"""
self.keys: List[int] = []
self.children: List[BTreeNode] = []
self.is_leaf = is_leaf
self.max_keys = max_keys
self.values: List[str] = [] # Actual data (for leaves only)
def is_full(self) -> bool:
"""Check if node has reached max capacity."""
return len(self.keys) >= self.max_keys
def insert_non_full(self, key: int, value: str):
"""
Insert a key-value pair into a non-full node.
This is the core insertion logic.
"""
i = len(self.keys) - 1
if self.is_leaf:
# Insert into leaf directly
self.keys.append(None)
self.values.append(None)
# Shift keys to make room
while i >= 0 and self.keys[i] > key:
self.keys[i + 1] = self.keys[i]
self.values[i + 1] = self.values[i]
i -= 1
self.keys[i + 1] = key
self.values[i + 1] = value
else:
# Find child to insert into
while i >= 0 and self.keys[i] > key:
i -= 1
i += 1
# If child is full, split it
if self.children[i].is_full():
self.split_child(i)
# After split, decide which child to go into
if self.keys[i] < key:
i += 1
self.children[i].insert_non_full(key, value)
def split_child(self, i: int):
"""
Split a full child node. This is what happens when a node overflows.
Example: Node with keys [3, 6, 9] and max_keys=3 is full.
When inserting key=7:
- Split the full child into two: [3, 6] and [9]
- Promote middle key (6) to parent
- Parent becomes [3, 6, 9]
"""
full_child = self.children[i]
mid_index = len(full_child.keys) // 2
mid_key = full_child.keys[mid_index]
mid_value = full_child.values[mid_index] if full_child.values else None
# Create new sibling node
new_node = BTreeNode(is_leaf=full_child.is_leaf, max_keys=self.max_keys)
# Split keys and values
new_node.keys = full_child.keys[mid_index + 1:]
full_child.keys = full_child.keys[:mid_index]
if full_child.values:
new_node.values = full_child.values[mid_index + 1:]
full_child.values = full_child.values[:mid_index]
# Split children (if not leaf)
if not full_child.is_leaf:
new_node.children = full_child.children[mid_index + 1:]
full_child.children = full_child.children[:mid_index + 1]
# Insert mid_key into parent
self.keys.insert(i, mid_key)
self.children.insert(i + 1, new_node)
def search(self, key: int) -> Optional[str]:
"""
Search for a key in the B-Tree.
Returns the associated value or None if not found.
"""
i = 0
# Find the first key >= search key
while i < len(self.keys) and key > self.keys[i]:
i += 1
# Check if we found the key
if i < len(self.keys) and key == self.keys[i]:
return self.values[i] if self.is_leaf else None
# If leaf, key not found
if self.is_leaf:
return None
# Recurse to child
return self.children[i].search(key)
class BTreeIndex:
"""
A simple B-Tree index implementation demonstrating core operations.
Real databases use much larger branching factors (300+ keys per node).
This example uses small max_keys (3-4) to illustrate splits.
"""
def __init__(self, max_keys=3):
"""Initialize B-Tree with given maximum keys per node."""
self.root = BTreeNode(is_leaf=True, max_keys=max_keys)
self.max_keys = max_keys
def insert(self, key: int, value: str):
"""
Insert a key-value pair into the B-Tree.
If root is full, create a new root (increases tree height).
"""
if self.root.is_full():
# Root is full; create new root
new_root = BTreeNode(is_leaf=False, max_keys=self.max_keys)
new_root.children.append(self.root)
new_root.split_child(0)
self.root = new_root
self.root.insert_non_full(key, value)
def search(self, key: int) -> Optional[str]:
"""Search for a key in the B-Tree."""
return self.root.search(key)
def range_query(self, key_min: int, key_max: int) -> List[Tuple[int, str]]:
"""
Return all (key, value) pairs where key_min <= key <= key_max.
In a real B-Tree, this would be an efficient operation because
leaf nodes are linked, allowing sequential traversal.
"""
result = []
self._range_query_helper(self.root, key_min, key_max, result)
return sorted(result)
def _range_query_helper(self, node: BTreeNode, key_min: int, key_max: int, result: List):
"""Helper method to recursively collect range results."""
i = 0
while i < len(node.keys):
if node.keys[i] >= key_min:
break
i += 1
if not node.is_leaf:
self._range_query_helper(node.children[i], key_min, key_max, result)
while i < len(node.keys) and node.keys[i] <= key_max:
if node.is_leaf:
result.append((node.keys[i], node.values[i]))
if not node.is_leaf:
self._range_query_helper(node.children[i + 1], key_min, key_max, result)
i += 1
# Example usage demonstrating B-Tree splits
if __name__ == "__main__":
# Create B-Tree with max 3 keys per node (forces splits to happen quickly)
btree = BTreeIndex(max_keys=3)
# Insert keys in order; observe splits
keys_values = [
(10, "alice"), (20, "bob"), (5, "charlie"),
(6, "diana"), (12, "eve"), (30, "frank"),
(7, "grace"), (17, "henry")
]
print("Inserting keys (order matters for splits):")
for key, value in keys_values:
btree.insert(key, value)
print(f" Inserted {key} → {value}")
print("\nPoint lookups:")
for key in [10, 20, 5, 99]:
result = btree.search(key)
print(f" Search {key}: {result if result else 'NOT FOUND'}")
print("\nRange queries:")
result = btree.range_query(6, 20)
print(f" Keys in range [6, 20]: {result}")
result = btree.range_query(5, 30)
print(f" Keys in range [5, 30]: {result}")
Composite Index — Left-Prefix Rule in Depth
When you create a composite index on (country, city, name), the B-Tree is structured like this:
1
2
3
4
5
6
7
8
9
10
11
Level 0 (Root):
country='DE' | country='FR' | country='US'
Level 1:
DE → city='Berlin' | city='Munich'
FR → city='Paris' | city='Lyon'
...
Level 2:
Berlin → name='Alice' | name='Bob' | name='Charlie'
...
The keys are sorted first by country, then by city within each country, then by name within each city. This structure allows these queries to use the index:
WHERE country = 'DE'— Jump to the DE subtree; all results are there.WHERE country = 'DE' AND city = 'Berlin'— Jump to DE, then Berlin subtree.WHERE country = 'DE' AND city = 'Berlin' AND name = 'Alice'— Jump to leaf containing Alice.
But these queries CANNOT use the index efficiently:
WHERE city = 'Berlin'— City is the 2nd key; the tree is sorted by country first. You’d have to scan the entire tree looking for ‘Berlin’ entries across all countries.WHERE name = 'Alice'— Name is the 3rd key; same problem.
This is why the left-prefix rule exists: you must provide the leftmost columns in order for the index to help. Violations of this rule force a full index scan (or table scan) instead of a tree traversal.
When to Use / Avoid
Use Index When
- ✅ Query selects < 5% of rows (fetching 5% or less benefits from index vs full scan)
- ✅ Column is frequently filtered in WHERE clauses
- ✅ Workload is read-heavy (reads » writes)
- ✅ Queries must return results in sorted order (index provides ORDER BY for free)
- ✅ Join predicates on foreign keys (speeds up join operations)
Avoid Index When
- ❌ Workload is write-heavy (insert/update cost dominates)
- ❌ Column has very low cardinality (< 10 distinct values; bitmap index might be better)
- ❌ Column is frequently NULL (many indexes skip NULLs or handle them inefficiently)
- ❌ Query selects > 20% of rows (full table scan becomes cheaper)
- ❌ Database is memory-constrained (index consumes valuable cache/memory)
Deciding Between Index Types
| Use Case | Index Type | Why |
|---|---|---|
| Primary key / unique constraint | B-Tree | Default; supports all query types. |
| Point lookups on unordered data | Hash | O(1) if range queries aren’t needed. |
| Full-text search (content column) | Inverted | Only option for word-based searches. |
| Country / status / bool columns | Bitmap | 10x smaller than B-Tree; fast boolean queries. |
| Geographic location queries | R-Tree / Geohash | Essential for spatial data. |
| Denormalized INCLUDE columns | Covering Index | Eliminates table I/O for SELECT-only queries. |
| Mostly-inactive rows | Partial Index | Reduce index size by 50–90%. |
References
- 📄 The Ubiquitous B-Tree — Comer (1979) — Foundational paper explaining B-Tree structure and why it dominates disk-based indexes.
- 📄 Database Cracking — Schuhknecht et al. (2016) — Explores adaptive indexing that learns access patterns.
- 📄 Column Stores for Wide and Sparse Data — Abadi et al. (2009) — Discusses bitmap indexes and columnar storage benefits.
- 🎥 ByteByteGo — Database Indexing — Excellent visual explanation of B-Tree splits and index selection.
- 🔗 PostgreSQL B-Tree Documentation — Technical details on PostgreSQL’s B-Tree implementation.
- 🔗 MySQL InnoDB Index Documentation — InnoDB index types and usage guidelines.
- 🔗 Elasticsearch Inverted Index — How Elasticsearch implements inverted indexes.
- 📖 Wikipedia: B-Tree — Clear overview with diagrams.
- 📖 Wikipedia: Inverted Index — Quick reference for full-text index structure.