LSM Tree & SSTable
Log-Structured Merge Tree -- optimise for write-heavy workloads by batching writes in memory then flushing sorted to disk. Reads pay the cost.
SSTable Structure
| Component | Purpose |
|---|---|
| Data blocks | Sorted key-value pairs |
| Index block | Sparse index (every Nth key -> byte offset) |
| Bloom filter | Fast “key not here” check |
| Footer | Pointers to index & filter blocks |
Compaction Strategies
| Strategy | How | Best For |
|---|---|---|
| Size-tiered | Merge SSTables of similar size | Write-heavy |
| Leveled | Bounded levels, smaller write amplification on reads | Read-balanced |
| FIFO | Discard oldest SSTables | Time-series / append-only |
Key Trade-offs
| LSM Tree | B-Tree | |
|---|---|---|
| Write speed | Fast (sequential) | Slower (random I/O) |
| Read speed | Slower (multi-level) | Fast (single lookup) |
| Space | Temp amplification | Compact |
| Best for | Write-heavy (Cassandra, Kafka) | Read-heavy (MySQL, Postgres) |
When to Use / Avoid
- Use for high write throughput (logs, IoT, time-series)
- Use for append-heavy workloads
- Use for SSD-friendly sequential I/O
- Avoid for read-heavy with point lookups
- Avoid when you need low read latency guarantees
Real Systems
| System | How Used | |——–|———-| | Cassandra | Primary storage engine | | LevelDB / RocksDB | Foundation for many NoSQL DBs | | HBase | Built on LevelDB-style LSM | | InfluxDB | Time-series storage | | Kafka | Log segments (LSM-inspired) |
Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import heapq
from collections import OrderedDict
from io import StringIO
class LSMTree:
"""Simplified LSM Tree with MemTable, SSTable, and basic compaction."""
def __init__(self, memtable_size=5, l0_threshold=2):
self.memtable = OrderedDict() # In-memory, sorted dict
self.memtable_max_size = memtable_size
self.wal = [] # Write-ahead log
self.sstables = [] # List of sorted string tables
self.l0_threshold = l0_threshold
def write(self, key, value):
"""
Write path: append to WAL -> insert into MemTable ->
if full, freeze & flush to disk.
"""
# Step 1: Write to Write-Ahead Log (crash safety)
self.wal.append(('SET', key, value))
# Step 2: Insert into MemTable (sorted by key)
self.memtable[key] = value
# Step 3: If MemTable exceeds threshold, flush to SSTable
if len(self.memtable) >= self.memtable_max_size:
self._flush_memtable()
self._compact_if_needed()
def read(self, key):
"""
Read path: check MemTable first (newest) ->
check SSTables newest to oldest ->
check Bloom filter (optional) -> return value or None.
"""
# Step 1: Check MemTable (hot data)
if key in self.memtable:
return self.memtable[key]
# Step 2: Check SSTables in reverse order (newest first)
for sstable in reversed(self.sstables):
if key in sstable:
return sstable[key]
# Step 3: Not found
return None
def _flush_memtable(self):
"""
Freeze MemTable and write to SSTable (sorted file).
In production: serialize to disk; here: keep in-memory dict.
"""
if self.memtable:
# Create immutable SSTable (sorted by key)
sstable = dict(sorted(self.memtable.items()))
self.sstables.append(sstable)
# Clear MemTable and WAL
self.memtable = OrderedDict()
self.wal = []
def _compact_if_needed(self):
"""
Trigger compaction if too many L0 (newest) SSTables.
Simple strategy: merge two oldest SSTables.
"""
if len(self.sstables) >= self.l0_threshold:
# Merge two oldest SSTables
sstable1 = self.sstables.pop(0)
sstable2 = self.sstables.pop(0)
merged = dict(sorted(list(sstable1.items()) + list(sstable2.items())))
self.sstables.insert(0, merged)
def scan_range(self, start_key, end_key):
"""
Example range query: merge sorted results from all levels.
"""
results = {}
# Scan MemTable
for k, v in self.memtable.items():
if start_key <= k <= end_key:
results[k] = v
# Scan SSTables (newest to oldest — first value wins)
for sstable in reversed(self.sstables):
for k, v in sstable.items():
if start_key <= k <= end_key and k not in results:
results[k] = v
return sorted(results.items())
# Example usage
lsm = LSMTree(memtable_size=3, l0_threshold=2)
lsm.write('key1', 'value1')
lsm.write('key2', 'value2')
lsm.write('key3', 'value3') # MemTable flush triggered
lsm.write('key4', 'value4')
print(lsm.read('key2')) # Returns 'value2' from SSTable
print(lsm.scan_range('key1', 'key3')) # Range query example
Algorithm Diagram:
Cassandra
Cassandra uses LSM as its primary storage engine with a SizeTieredCompaction strategy optimized for write-heavy workloads. It accumulates MemTables (Skiplists) and flushes them to immutable SSTables on disk. Configuration is tunable: memtable_flush_writers controls parallelism, and compaction_throughput_mb_per_sec limits compaction overhead during peak traffic. For read-heavy scenarios, administrators can switch to LeveledCompaction (smaller reads amplification) at the cost of higher write amplification.
RocksDB
Facebook’s RocksDB is a modern fork of LevelDB that adds Universal Compaction (another strategy), multi-column families, and block cache optimization. It powers MyRocks (MySQL variant) and CockroachDB’s storage layer. RocksDB supports both block-based and plain table formats; block-based is default but trades CPU for I/O efficiency. Many cloud systems embed RocksDB for high write throughput without operational burden of a separate storage service.
LevelDB
Google’s LevelDB is the original LSM implementation, introducing level-based compaction: each level k is ~10x larger than level k-1, and compaction moves data upward in a controlled way. It’s a foundational piece used in Chrome (IndexedDB), many embedded databases, and influenced the entire LSM ecosystem. LevelDB’s simplicity makes it ideal for education and embedded systems.
InfluxDB (TSM — Time-Structured Merge Tree)
InfluxDB extends LSM with TSM, a time-series variant where data is columnar and time-bucketed: separate MemTables per shard (hour or day), flush to read-only cache+disk, compaction merges shards by time range. This columnar layout is more cache-efficient for time-series queries (e.g., “all temperature readings”) than key-value LSM. Configuration: cache-max-memory-bytes, compact-threshold-ratio, compact-min-file-count.
Kafka
Kafka uses log-structured storage inspired by LSM principles: append-only segment files (not exactly SSTable), periodic compaction removes old keys and keeps only latest versions (log compaction mode). Each partition is an immutable log divided into segments; compaction consolidates and purges. Broker configuration log.retention.hours and log.cleanup.policy control behavior. This design enables Kafka to handle massive append throughput (billions of messages) with minimal random I/O.
References
- O’Neil et al. (1996) — The Log-Structured Merge-Tree — Original LSM research paper introducing write batching and level-based compaction
- Google Bigtable Paper — Describes SSTable format and compaction strategies used across Google’s infrastructure
- RocksDB Wiki & Documentation — Comprehensive guide to modern LSM implementation details, tuning, and compaction strategies
- ByteByteGo — The Secret Sauce Behind NoSQL: LSM Tree — Visual walkthrough of write and read paths, compaction trade-offs
- Wikipedia: Log-Structured Merge-Tree — Quick reference and historical context