Database Indexing

An index trades write overhead and storage for dramatically faster reads by organizing data in a sortable structure -- choose the index type and key columns based on your query patterns.

Posted Mar 25, 2024

19 min read

Database Indexing

Key Properties

Property	B-Tree Index	LSM-Tree Index	Hash Index	Inverted Index
Point lookup	O(log n)	O(log n)	O(1)	O(n)
Range query	O(log n + k)	O(log n + k)	N/A	O(k)
Insert	O(log n)	O(1) amortized	O(1)	O(1)
Delete	O(log n)	O(log n)	O(1)	O(1)
Space overhead	~10–20%	1.5–3x (compaction)	~1.2x	2–5x
Best for	OLTP, ranges	Write-heavy	Exact matches	Full-text search
Read latency	~0.01–0.1ms	~0.1–1ms	~0.001ms	~0.5–5ms

B-Tree vs LSM-Tree Index Trade-offs

B-Tree Index (Most Common — PostgreSQL, MySQL, MongoDB)

Mechanism: Maintains a balanced tree structure where each internal node branches to ~300 children (PostgreSQL uses 8KB page size). Keys are sorted; leaf pages contain pointers to row data. When you insert a key into a full node, the node splits into two nodes, pushing the median key upward.

Page splits cost: When a leaf page overflows (e.g., adding the 367th key to a 366-key page), PostgreSQL writes: (1) the original leaf (now ~50% full), (2) new sibling leaf (~50% full), (3) parent internal node update, (4) potential cascading parent splits. This is why B-Tree insert is “expensive” (~100 microseconds per insert with disk I/O), but read performance dominates because a 10-million-row table needs only ~3 disk seeks.

Real-world B-Tree shape: PostgreSQL B-Tree on a 1-billion row table (8 bytes per key + 6 bytes overhead per entry = 14 bytes/entry). A page holds ~570 entries. Internal nodes fan out to ~300 children. Depth: log₃₀₀(1B) ≈ 3–4 pages. A point lookup touches ~3 pages (root→internal→leaf).

When it dominates: OLTP databases where you read far more often than write. A 100:1 read/write ratio makes B-Tree’s slower writes worth the faster reads. Range queries (WHERE timestamp BETWEEN '2024-01-01' AND '2024-12-31') are efficient because consecutive keys are physically adjacent in leaf pages.

LSM-Tree Index (Cassandra, RocksDB, LevelDB, HBase)

Mechanism: Writes don’t touch the index structure directly. Instead, writes go to a fast in-memory structure (MemTable), then periodically flush to immutable SSTable files on disk organized in levels. Each level contains non-overlapping key ranges. On read, you check MemTable first, then L0 (newest SSTable files), then L1, L2, etc. until you find the key.

Write amplification: A write that goes through all levels incurs “amplification.” With 10 levels, a single write might be rewritten 10 times as it moves from MemTable→L0→L1→…→L9 during compaction. However, writes are sequential (append to MemTable, then SSTable files), which is 10–100x faster than random B-Tree page updates.

Read latency variance: LSM reads are slower and have higher variance. Finding a key in L9 requires checking and potentially reading from multiple files. With Bloom filters on each SSTable, you skip most files, but worst-case reads can touch 10+ files. B-Tree always touches ~3 pages, making latency predictable.

Real-world LSM usage: RocksDB at Meta/Facebook handles 1000x more writes than reads (logs, event streams). A write that would take 100 microseconds in B-Tree takes 10 microseconds in RocksDB because it’s just appending to MemTable. The cost is deferred: background compaction threads rewrite old data, but this happens off the critical path.

Trade-off Summary

Scenario	Winner	Why
10:1 read/write ratio	B-Tree	Amortize index maintenance cost over many reads.
1:10 read/write ratio	LSM-Tree	Sequential writes dominate; read latency is acceptable.
Latency-sensitive reads (p99 < 10ms)	B-Tree	Predictable read path; no variance.
Throughput-sensitive writes (1M+ ops/sec)	LSM-Tree	Sequential I/O is 10x faster.
Point lookups only	Hash Index	O(1) is unbeatable; don’t need range queries.
Full-text search	Inverted Index	Only viable option; structured for term→document mapping.

Index Types by Access Pattern

1. B-Tree Index (Default — Use First)

What: Sorted tree structure on disk; supports range queries and prefix matching.

When to use:

✅ Read-heavy workloads (OLTP)
✅ Range queries (WHERE created_at > '2024-01-01')
✅ Prefix queries (WHERE name LIKE 'Al%')
✅ Ordering without sort (ORDER BY uses index)

Real systems: PostgreSQL primary index, MySQL InnoDB, MongoDB (WiredTiger engine)

2. Hash Index

What: Direct hash of key to byte offset; no ordering.

When to use:

✅ Exact equality lookups only (WHERE id = 42)
✅ In-memory caches (Redis HSET)
✅ High cardinality columns

Avoid:

❌ Range queries
❌ LIKE/prefix matching
❌ ORDER BY

Real systems: Redis hash tables, Memcached, in-memory hash maps

3. Inverted Index

What: Maps each unique term/word to a list of document IDs (or positions) containing that term.

Structure:

Term "distributed" → [doc_1, doc_5, doc_12, ...]
Term "systems" → [doc_1, doc_3, doc_7, ...]

When to use:

✅ Full-text search (WHERE body @@ 'distributed systems')
✅ Multi-word queries with boolean operators (AND, OR, NOT)
✅ Phrase queries ("exact phrase")

Real systems: Elasticsearch (Lucene), PostgreSQL tsvector, MySQL FULLTEXT

4. Bitmap Index

What: For each distinct value, maintain a bitmap of rows containing that value. Useful for low-cardinality columns.

Example (status column with 3 values):

'active'   → 1011010101... (bit i is 1 if row i has status='active')
'inactive' → 0100101010...
'pending'  → 0000000001...

When to use:

✅ Low-cardinality columns (< 100 distinct values)
✅ Columns that are frequently filtered (status, country, month)
✅ Complex boolean filters (status='active' AND country='DE')

Space efficiency: A bitmap index on a 100-million row table with 5 status values uses only 6.25 MB (100M bits / 8 bytes per bit), vs 400 MB for a B-Tree index.

Real systems: Oracle (supports bitmap indexes natively), Vertica, Druid

5. Geospatial Index (R-Tree, Quadtree, Geohash)

What: Organizes spatial data (lat/lng points) into hierarchical bounding boxes or quadrants.

R-Tree structure: Each internal node is a bounding rectangle containing child rectangles. A point query finds the bounding box, then recursively searches children.

Geohash variant (Uber, Redis GEO): Encodes lat/lng as interleaved bits, creating a Z-order curve (Hilbert curve). Close geographic points hash to nearby strings, enabling prefix-based range queries.

When to use:

✅ Geographic queries (ST_Distance(location, '(0,0)') < 1km)
✅ Bounding box queries (ST_Within(location, bbox))
✅ Nearest-neighbor search

Real systems: PostGIS (PostgreSQL), MongoDB (2dsphere), Redis (GEOADD), Google S2

6. Covering Index (Nonkey Index)

What: A B-Tree index that includes non-key columns, allowing the entire result set to be read from the index leaf pages without touching the main table.

Example:

        
      
CREATE INDEX idx_user_name_email ON users(user_id, email) INCLUDE (name, created_at);
-- Query: SELECT name, email FROM users WHERE user_id = 42
-- Can be satisfied entirely from index — no table lookup needed

Cost: Covering indexes are larger (include extra columns), so they increase write cost and memory usage, but eliminate table I/O for qualifying queries.

Real systems: PostgreSQL INCLUDE clause (v11+), SQL Server, MySQL covering indexes

7. Partial Index (Filtered Index)

What: Index only rows matching a WHERE condition, reducing index size and write cost.

Example:

        
      
CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';
-- Indexes only 10% of users (most are inactive)
-- 90% smaller index size, 90% fewer index updates

When to use:

✅ Most queries filter by the same condition
✅ Skewed data (99% inactive, 1% active)

Real systems: PostgreSQL, SQL Server (filtered indexes)

8. Composite Index (Multi-Column)

What: Index on multiple columns in a specific order. Supports leftmost-prefix matching.

Left-prefix rule example:

        
      
-- Index on (country, city, name)

-- ✅ USES index: WHERE country = 'DE'
-- ✅ USES index: WHERE country = 'DE' AND city = 'Berlin'
-- ✅ USES index: WHERE country = 'DE' AND city = 'Berlin' AND name = 'Alice'
-- ❌ DOESN'T USE: WHERE city = 'Berlin' (no leftmost prefix 'country')
-- ❌ DOESN'T USE: WHERE name = 'Alice' (skip both prefixes)

Why: B-Tree keeps keys sorted left-to-right. The first column defines the major sort order; you can’t efficiently search by the 2nd column if you don’t know the 1st.

Cost: Each column added increases index size and write overhead. PostgreSQL recommends max 3–4 columns per index.

How Real Systems Use Indexing

PostgreSQL B-Tree Internals

PostgreSQL uses a B+ Tree (keys in internal nodes, data in leaf nodes) with configurable page size (default 8KB). Each page can hold ~300 keys (PostgreSQL B-Tree format: 4-byte key + 4-byte offset per entry = 8 bytes + overhead). When you insert a key into a full leaf page (366th key into a 366-entry page), PostgreSQL splits the page: new internal node links both leaves, and the median key is pushed to the parent. On a billion-row table, the tree has depth 3–4. A point lookup performs 3 random seeks to disk (root→internal→leaf), each taking ~10 milliseconds on spinning disk (or ~0.01ms on SSD). Range queries are efficient because consecutive keys in leaf pages point to consecutive rows (clustered index), so scanning 1000 rows sequentially after the first lookup costs only 1–2 additional seeks. This is why PostgreSQL dominates OLTP: predictable read latency and excellent range query performance.

MySQL InnoDB Page Splits

MySQL InnoDB B-Tree is organized by 16KB pages (larger than PostgreSQL). When a page splits, InnoDB doesn’t always split 50/50; it uses a bias toward the direction of insertion. If you’re inserting monotonically increasing keys (timestamps), InnoDB splits newly-filled pages 10/90 (old page 90% full, new page 10% full), reducing future splits when you insert the next keys into the 10% side. This optimization reduces write amplification for time-series data. A production MySQL server might see 10,000 INSERTs/sec on a primary key; without this bias, each insertion would trigger expensive page reorganization. With the bias, only 1% of inserts actually trigger splits, deferring cost.

Cassandra SSTable Indexes

Cassandra uses LSM-Tree with per-SSTable Bloom filters and partition indexes. Each SSTable file contains a summary of which partition keys it holds (a sparse index — 1 entry per 128KB of SSTable). When reading a key, Cassandra checks the partition summary to filter SSTables, then checks the Bloom filter to avoid disk I/O if the key is definitely not in that SSTable. For a write-heavy table with thousands of SSTables, this combination allows negative lookups (checking 1000 SSTables for a missing key) to be answered without touching disk. The Bloom filter is sized for ~1% false positive rate (10 bits per key), costing only ~1.25 bytes per entry, yet saving 99% of disk seeks for reads that miss.

Elasticsearch Inverted Index Structure

Elasticsearch (built on Lucene) creates a term dictionary mapping each unique word to a posting list (list of documents + term positions). For the term “distributed,” the posting list might be [doc_1@pos_5, doc_5@pos_12, doc_12@pos_2, …]. The dictionary itself is a FST (Finite State Transducer) — a compressed data structure allowing O(log n) prefix lookups. For a 1-billion document corpus with 50GB of text, the inverted index is typically 1–2GB (2–4% of source data). A full-text search query (“distributed systems”) is intersected: terms are looked up in the dictionary, posting lists are merged (documents appearing in both “distributed” AND “systems”), then results are scored by relevance (TF-IDF). This query that would require scanning all 50GB of source data in a sequential scan completes in <100ms using the inverted index.

PostGIS R-Tree for Geospatial

PostGIS maintains an R-Tree index on geographic columns, organizing spatial objects into hierarchical bounding rectangles. For a table of 10-million geographic points, the R-Tree has depth ~5–6 levels. A nearest-neighbor query (SELECT * FROM locations ORDER BY location <-> point(0,0) LIMIT 10) uses the R-Tree to traverse the tree in distance order, finding the 10 closest points without checking all 10 million. On real production data, this query completes in 1–2ms, whereas a full-table scan would require 100+ seconds. Bounding box queries (SELECT * FROM polygons WHERE ST_Contains(polygon, point(0,0))) use the R-Tree to eliminate 99%+ of polygons before performing expensive geometric calculations.

Redis Sorted Set (Skip List Index)

Redis Sorted Set is implemented as a skip list (a probabilistic balanced structure), not a B-Tree. Each element has a score, and the skip list maintains elements in sorted order by score, with ~log(n) levels of shortcuts. A 1-million element sorted set uses ~1.5 levels of skipping. Range queries (ZRANGEBYSCORE myzset 10 20) are O(log n) to find the start, then O(k) to return k elements. This is faster than B-Tree for in-memory workloads because skip lists require fewer pointer chases (linked list vs tree navigation). Redis uses this for leaderboards (score = player ranking), rate limiting buckets (score = timestamp), and pub/sub (score = message sequence).

Implementation — B-Tree Index Simulation

        
      
import bisect
from typing import List, Tuple, Optional

class BTreeNode:
    """
    Represents a node in a B-Tree.

    Each node contains:
    - keys: sorted list of key values
    - children: list of child nodes (empty if leaf)
    - is_leaf: boolean indicating if this is a leaf node
    """

    def __init__(self, is_leaf=True, max_keys=3):
        """
        Initialize a B-Tree node.

        Args:
            is_leaf: whether this is a leaf node
            max_keys: maximum keys per node (controls branching factor)
        """
        self.keys: List[int] = []
        self.children: List[BTreeNode] = []
        self.is_leaf = is_leaf
        self.max_keys = max_keys
        self.values: List[str] = []  # Actual data (for leaves only)

    def is_full(self) -> bool:
        """Check if node has reached max capacity."""
        return len(self.keys) >= self.max_keys

    def insert_non_full(self, key: int, value: str):
        """
        Insert a key-value pair into a non-full node.
        This is the core insertion logic.
        """
        i = len(self.keys) - 1

        if self.is_leaf:
            # Insert into leaf directly
            self.keys.append(None)
            self.values.append(None)

            # Shift keys to make room
            while i >= 0 and self.keys[i] > key:
                self.keys[i + 1] = self.keys[i]
                self.values[i + 1] = self.values[i]
                i -= 1

            self.keys[i + 1] = key
            self.values[i + 1] = value
        else:
            # Find child to insert into
            while i >= 0 and self.keys[i] > key:
                i -= 1
            i += 1

            # If child is full, split it
            if self.children[i].is_full():
                self.split_child(i)

                # After split, decide which child to go into
                if self.keys[i] < key:
                    i += 1

            self.children[i].insert_non_full(key, value)

    def split_child(self, i: int):
        """
        Split a full child node. This is what happens when a node overflows.

        Example: Node with keys [3, 6, 9] and max_keys=3 is full.
        When inserting key=7:
        - Split the full child into two: [3, 6] and [9]
        - Promote middle key (6) to parent
        - Parent becomes [3, 6, 9]
        """
        full_child = self.children[i]
        mid_index = len(full_child.keys) // 2
        mid_key = full_child.keys[mid_index]
        mid_value = full_child.values[mid_index] if full_child.values else None

        # Create new sibling node
        new_node = BTreeNode(is_leaf=full_child.is_leaf, max_keys=self.max_keys)

        # Split keys and values
        new_node.keys = full_child.keys[mid_index + 1:]
        full_child.keys = full_child.keys[:mid_index]

        if full_child.values:
            new_node.values = full_child.values[mid_index + 1:]
            full_child.values = full_child.values[:mid_index]

        # Split children (if not leaf)
        if not full_child.is_leaf:
            new_node.children = full_child.children[mid_index + 1:]
            full_child.children = full_child.children[:mid_index + 1]

        # Insert mid_key into parent
        self.keys.insert(i, mid_key)
        self.children.insert(i + 1, new_node)

    def search(self, key: int) -> Optional[str]:
        """
        Search for a key in the B-Tree.
        Returns the associated value or None if not found.
        """
        i = 0

        # Find the first key >= search key
        while i < len(self.keys) and key > self.keys[i]:
            i += 1

        # Check if we found the key
        if i < len(self.keys) and key == self.keys[i]:
            return self.values[i] if self.is_leaf else None

        # If leaf, key not found
        if self.is_leaf:
            return None

        # Recurse to child
        return self.children[i].search(key)


class BTreeIndex:
    """
    A simple B-Tree index implementation demonstrating core operations.

    Real databases use much larger branching factors (300+ keys per node).
    This example uses small max_keys (3-4) to illustrate splits.
    """

    def __init__(self, max_keys=3):
        """Initialize B-Tree with given maximum keys per node."""
        self.root = BTreeNode(is_leaf=True, max_keys=max_keys)
        self.max_keys = max_keys

    def insert(self, key: int, value: str):
        """
        Insert a key-value pair into the B-Tree.

        If root is full, create a new root (increases tree height).
        """
        if self.root.is_full():
            # Root is full; create new root
            new_root = BTreeNode(is_leaf=False, max_keys=self.max_keys)
            new_root.children.append(self.root)
            new_root.split_child(0)
            self.root = new_root

        self.root.insert_non_full(key, value)

    def search(self, key: int) -> Optional[str]:
        """Search for a key in the B-Tree."""
        return self.root.search(key)

    def range_query(self, key_min: int, key_max: int) -> List[Tuple[int, str]]:
        """
        Return all (key, value) pairs where key_min <= key <= key_max.

        In a real B-Tree, this would be an efficient operation because
        leaf nodes are linked, allowing sequential traversal.
        """
        result = []
        self._range_query_helper(self.root, key_min, key_max, result)
        return sorted(result)

    def _range_query_helper(self, node: BTreeNode, key_min: int, key_max: int, result: List):
        """Helper method to recursively collect range results."""
        i = 0

        while i < len(node.keys):
            if node.keys[i] >= key_min:
                break
            i += 1

        if not node.is_leaf:
            self._range_query_helper(node.children[i], key_min, key_max, result)

        while i < len(node.keys) and node.keys[i] <= key_max:
            if node.is_leaf:
                result.append((node.keys[i], node.values[i]))

            if not node.is_leaf:
                self._range_query_helper(node.children[i + 1], key_min, key_max, result)

            i += 1


# Example usage demonstrating B-Tree splits
if __name__ == "__main__":
    # Create B-Tree with max 3 keys per node (forces splits to happen quickly)
    btree = BTreeIndex(max_keys=3)

    # Insert keys in order; observe splits
    keys_values = [
        (10, "alice"), (20, "bob"), (5, "charlie"),
        (6, "diana"), (12, "eve"), (30, "frank"),
        (7, "grace"), (17, "henry")
    ]

    print("Inserting keys (order matters for splits):")
    for key, value in keys_values:
        btree.insert(key, value)
        print(f"  Inserted {key} → {value}")

    print("\nPoint lookups:")
    for key in [10, 20, 5, 99]:
        result = btree.search(key)
        print(f"  Search {key}: {result if result else 'NOT FOUND'}")

    print("\nRange queries:")
    result = btree.range_query(6, 20)
    print(f"  Keys in range [6, 20]: {result}")

    result = btree.range_query(5, 30)
    print(f"  Keys in range [5, 30]: {result}")

Composite Index — Left-Prefix Rule in Depth

When you create a composite index on (country, city, name), the B-Tree is structured like this:

Level 0 (Root):
  country='DE' | country='FR' | country='US'

Level 1:
  DE → city='Berlin' | city='Munich'
  FR → city='Paris' | city='Lyon'
  ...

Level 2:
  Berlin → name='Alice' | name='Bob' | name='Charlie'
  ...

The keys are sorted first by country, then by city within each country, then by name within each city. This structure allows these queries to use the index:

WHERE country = 'DE' — Jump to the DE subtree; all results are there.
WHERE country = 'DE' AND city = 'Berlin' — Jump to DE, then Berlin subtree.
WHERE country = 'DE' AND city = 'Berlin' AND name = 'Alice' — Jump to leaf containing Alice.

But these queries CANNOT use the index efficiently:

WHERE city = 'Berlin' — City is the 2nd key; the tree is sorted by country first. You’d have to scan the entire tree looking for ‘Berlin’ entries across all countries.
WHERE name = 'Alice' — Name is the 3rd key; same problem.

This is why the left-prefix rule exists: you must provide the leftmost columns in order for the index to help. Violations of this rule force a full index scan (or table scan) instead of a tree traversal.

When to Use / Avoid

Use Index When

✅ Query selects < 5% of rows (fetching 5% or less benefits from index vs full scan)
✅ Column is frequently filtered in WHERE clauses
✅ Workload is read-heavy (reads » writes)
✅ Queries must return results in sorted order (index provides ORDER BY for free)
✅ Join predicates on foreign keys (speeds up join operations)

Avoid Index When

❌ Workload is write-heavy (insert/update cost dominates)
❌ Column has very low cardinality (< 10 distinct values; bitmap index might be better)
❌ Column is frequently NULL (many indexes skip NULLs or handle them inefficiently)
❌ Query selects > 20% of rows (full table scan becomes cheaper)
❌ Database is memory-constrained (index consumes valuable cache/memory)

Deciding Between Index Types

Use Case	Index Type	Why
Primary key / unique constraint	B-Tree	Default; supports all query types.
Point lookups on unordered data	Hash	O(1) if range queries aren’t needed.
Full-text search (content column)	Inverted	Only option for word-based searches.
Country / status / bool columns	Bitmap	10x smaller than B-Tree; fast boolean queries.
Geographic location queries	R-Tree / Geohash	Essential for spatial data.
Denormalized INCLUDE columns	Covering Index	Eliminates table I/O for SELECT-only queries.
Mostly-inactive rows	Partial Index	Reduce index size by 50–90%.

References

📄 The Ubiquitous B-Tree — Comer (1979) — Foundational paper explaining B-Tree structure and why it dominates disk-based indexes.
📄 Database Cracking — Schuhknecht et al. (2016) — Explores adaptive indexing that learns access patterns.
📄 Column Stores for Wide and Sparse Data — Abadi et al. (2009) — Discusses bitmap indexes and columnar storage benefits.
🎥 ByteByteGo — Database Indexing — Excellent visual explanation of B-Tree splits and index selection.
🔗 PostgreSQL B-Tree Documentation — Technical details on PostgreSQL’s B-Tree implementation.
🔗 MySQL InnoDB Index Documentation — InnoDB index types and usage guidelines.
🔗 Elasticsearch Inverted Index — How Elasticsearch implements inverted indexes.
📖 Wikipedia: B-Tree — Clear overview with diagrams.
📖 Wikipedia: Inverted Index — Quick reference for full-text index structure.

Software Architecture, Databases

distributed-systems

This post is licensed under CC BY 4.0 by the author.

Key Properties

B-Tree vs LSM-Tree Index Trade-offs

B-Tree Index (Most Common — PostgreSQL, MySQL, MongoDB)

LSM-Tree Index (Cassandra, RocksDB, LevelDB, HBase)

Trade-off Summary

Index Types by Access Pattern

1. B-Tree Index (Default — Use First)

2. Hash Index

3. Inverted Index

4. Bitmap Index

5. Geospatial Index (R-Tree, Quadtree, Geohash)

6. Covering Index (Nonkey Index)

7. Partial Index (Filtered Index)

8. Composite Index (Multi-Column)

How Real Systems Use Indexing

PostgreSQL B-Tree Internals

MySQL InnoDB Page Splits

Cassandra SSTable Indexes

Elasticsearch Inverted Index Structure

PostGIS R-Tree for Geospatial

Redis Sorted Set (Skip List Index)

Implementation — B-Tree Index Simulation

Composite Index — Left-Prefix Rule in Depth

When to Use / Avoid

Use Index When

Avoid Index When

Deciding Between Index Types

References

Trending Tags