Post

LSM Tree & SSTable

Log-Structured Merge Tree -- optimise for write-heavy workloads by batching writes in memory then flushing sorted to disk. Reads pay the cost.

LSM Tree & SSTable

SSTable Structure

Component Purpose
Data blocks Sorted key-value pairs
Index block Sparse index (every Nth key -> byte offset)
Bloom filter Fast “key not here” check
Footer Pointers to index & filter blocks

Compaction Strategies

Strategy How Best For
Size-tiered Merge SSTables of similar size Write-heavy
Leveled Bounded levels, smaller write amplification on reads Read-balanced
FIFO Discard oldest SSTables Time-series / append-only

Key Trade-offs

  LSM Tree B-Tree
Write speed Fast (sequential) Slower (random I/O)
Read speed Slower (multi-level) Fast (single lookup)
Space Temp amplification Compact
Best for Write-heavy (Cassandra, Kafka) Read-heavy (MySQL, Postgres)

When to Use / Avoid

  • Use for high write throughput (logs, IoT, time-series)
  • Use for append-heavy workloads
  • Use for SSD-friendly sequential I/O
  • Avoid for read-heavy with point lookups
  • Avoid when you need low read latency guarantees

Real Systems

| System | How Used | |——–|———-| | Cassandra | Primary storage engine | | LevelDB / RocksDB | Foundation for many NoSQL DBs | | HBase | Built on LevelDB-style LSM | | InfluxDB | Time-series storage | | Kafka | Log segments (LSM-inspired) |

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import heapq
from collections import OrderedDict
from io import StringIO

class LSMTree:
    """Simplified LSM Tree with MemTable, SSTable, and basic compaction."""

    def __init__(self, memtable_size=5, l0_threshold=2):
        self.memtable = OrderedDict()  # In-memory, sorted dict
        self.memtable_max_size = memtable_size
        self.wal = []  # Write-ahead log
        self.sstables = []  # List of sorted string tables
        self.l0_threshold = l0_threshold

    def write(self, key, value):
        """
        Write path: append to WAL -> insert into MemTable ->
        if full, freeze & flush to disk.
        """
        # Step 1: Write to Write-Ahead Log (crash safety)
        self.wal.append(('SET', key, value))

        # Step 2: Insert into MemTable (sorted by key)
        self.memtable[key] = value

        # Step 3: If MemTable exceeds threshold, flush to SSTable
        if len(self.memtable) >= self.memtable_max_size:
            self._flush_memtable()
            self._compact_if_needed()

    def read(self, key):
        """
        Read path: check MemTable first (newest) ->
        check SSTables newest to oldest ->
        check Bloom filter (optional) -> return value or None.
        """
        # Step 1: Check MemTable (hot data)
        if key in self.memtable:
            return self.memtable[key]

        # Step 2: Check SSTables in reverse order (newest first)
        for sstable in reversed(self.sstables):
            if key in sstable:
                return sstable[key]

        # Step 3: Not found
        return None

    def _flush_memtable(self):
        """
        Freeze MemTable and write to SSTable (sorted file).
        In production: serialize to disk; here: keep in-memory dict.
        """
        if self.memtable:
            # Create immutable SSTable (sorted by key)
            sstable = dict(sorted(self.memtable.items()))
            self.sstables.append(sstable)

            # Clear MemTable and WAL
            self.memtable = OrderedDict()
            self.wal = []

    def _compact_if_needed(self):
        """
        Trigger compaction if too many L0 (newest) SSTables.
        Simple strategy: merge two oldest SSTables.
        """
        if len(self.sstables) >= self.l0_threshold:
            # Merge two oldest SSTables
            sstable1 = self.sstables.pop(0)
            sstable2 = self.sstables.pop(0)
            merged = dict(sorted(list(sstable1.items()) + list(sstable2.items())))
            self.sstables.insert(0, merged)

    def scan_range(self, start_key, end_key):
        """
        Example range query: merge sorted results from all levels.
        """
        results = {}

        # Scan MemTable
        for k, v in self.memtable.items():
            if start_key <= k <= end_key:
                results[k] = v

        # Scan SSTables (newest to oldest — first value wins)
        for sstable in reversed(self.sstables):
            for k, v in sstable.items():
                if start_key <= k <= end_key and k not in results:
                    results[k] = v

        return sorted(results.items())

# Example usage
lsm = LSMTree(memtable_size=3, l0_threshold=2)
lsm.write('key1', 'value1')
lsm.write('key2', 'value2')
lsm.write('key3', 'value3')  # MemTable flush triggered
lsm.write('key4', 'value4')
print(lsm.read('key2'))  # Returns 'value2' from SSTable
print(lsm.scan_range('key1', 'key3'))  # Range query example

Algorithm Diagram:

Cassandra

Cassandra uses LSM as its primary storage engine with a SizeTieredCompaction strategy optimized for write-heavy workloads. It accumulates MemTables (Skiplists) and flushes them to immutable SSTables on disk. Configuration is tunable: memtable_flush_writers controls parallelism, and compaction_throughput_mb_per_sec limits compaction overhead during peak traffic. For read-heavy scenarios, administrators can switch to LeveledCompaction (smaller reads amplification) at the cost of higher write amplification.

RocksDB

Facebook’s RocksDB is a modern fork of LevelDB that adds Universal Compaction (another strategy), multi-column families, and block cache optimization. It powers MyRocks (MySQL variant) and CockroachDB’s storage layer. RocksDB supports both block-based and plain table formats; block-based is default but trades CPU for I/O efficiency. Many cloud systems embed RocksDB for high write throughput without operational burden of a separate storage service.

LevelDB

Google’s LevelDB is the original LSM implementation, introducing level-based compaction: each level k is ~10x larger than level k-1, and compaction moves data upward in a controlled way. It’s a foundational piece used in Chrome (IndexedDB), many embedded databases, and influenced the entire LSM ecosystem. LevelDB’s simplicity makes it ideal for education and embedded systems.

InfluxDB (TSM — Time-Structured Merge Tree)

InfluxDB extends LSM with TSM, a time-series variant where data is columnar and time-bucketed: separate MemTables per shard (hour or day), flush to read-only cache+disk, compaction merges shards by time range. This columnar layout is more cache-efficient for time-series queries (e.g., “all temperature readings”) than key-value LSM. Configuration: cache-max-memory-bytes, compact-threshold-ratio, compact-min-file-count.

Kafka

Kafka uses log-structured storage inspired by LSM principles: append-only segment files (not exactly SSTable), periodic compaction removes old keys and keeps only latest versions (log compaction mode). Each partition is an immutable log divided into segments; compaction consolidates and purges. Broker configuration log.retention.hours and log.cleanup.policy control behavior. This design enables Kafka to handle massive append throughput (billions of messages) with minimal random I/O.

References

This post is licensed under CC BY 4.0 by the author.