LSM Tree & SSTable

Log-Structured Merge Tree -- optimise for write-heavy workloads by batching writes in memory then flushing sorted to disk. Reads pay the cost.

Posted Jul 5, 2023

5 min read

LSM Tree & SSTable

SSTable Structure

Component	Purpose
Data blocks	Sorted key-value pairs
Index block	Sparse index (every Nth key -> byte offset)
Bloom filter	Fast “key not here” check
Footer	Pointers to index & filter blocks

Compaction Strategies

Strategy	How	Best For
Size-tiered	Merge SSTables of similar size	Write-heavy
Leveled	Bounded levels, smaller write amplification on reads	Read-balanced
FIFO	Discard oldest SSTables	Time-series / append-only

Key Trade-offs

	LSM Tree	B-Tree
Write speed	Fast (sequential)	Slower (random I/O)
Read speed	Slower (multi-level)	Fast (single lookup)
Space	Temp amplification	Compact
Best for	Write-heavy (Cassandra, Kafka)	Read-heavy (MySQL, Postgres)

When to Use / Avoid

Use for high write throughput (logs, IoT, time-series)
Use for append-heavy workloads
Use for SSD-friendly sequential I/O
Avoid for read-heavy with point lookups
Avoid when you need low read latency guarantees

Real Systems

Implementation

        
      
import heapq
from collections import OrderedDict
from io import StringIO

class LSMTree:
    """Simplified LSM Tree with MemTable, SSTable, and basic compaction."""

    def __init__(self, memtable_size=5, l0_threshold=2):
        self.memtable = OrderedDict()  # In-memory, sorted dict
        self.memtable_max_size = memtable_size
        self.wal = []  # Write-ahead log
        self.sstables = []  # List of sorted string tables
        self.l0_threshold = l0_threshold

    def write(self, key, value):
        """
        Write path: append to WAL -> insert into MemTable ->
        if full, freeze & flush to disk.
        """
        # Step 1: Write to Write-Ahead Log (crash safety)
        self.wal.append(('SET', key, value))

        # Step 2: Insert into MemTable (sorted by key)
        self.memtable[key] = value

        # Step 3: If MemTable exceeds threshold, flush to SSTable
        if len(self.memtable) >= self.memtable_max_size:
            self._flush_memtable()
            self._compact_if_needed()

    def read(self, key):
        """
        Read path: check MemTable first (newest) ->
        check SSTables newest to oldest ->
        check Bloom filter (optional) -> return value or None.
        """
        # Step 1: Check MemTable (hot data)
        if key in self.memtable:
            return self.memtable[key]

        # Step 2: Check SSTables in reverse order (newest first)
        for sstable in reversed(self.sstables):
            if key in sstable:
                return sstable[key]

        # Step 3: Not found
        return None

    def _flush_memtable(self):
        """
        Freeze MemTable and write to SSTable (sorted file).
        In production: serialize to disk; here: keep in-memory dict.
        """
        if self.memtable:
            # Create immutable SSTable (sorted by key)
            sstable = dict(sorted(self.memtable.items()))
            self.sstables.append(sstable)

            # Clear MemTable and WAL
            self.memtable = OrderedDict()
            self.wal = []

    def _compact_if_needed(self):
        """
        Trigger compaction if too many L0 (newest) SSTables.
        Simple strategy: merge two oldest SSTables.
        """
        if len(self.sstables) >= self.l0_threshold:
            # Merge two oldest SSTables
            sstable1 = self.sstables.pop(0)
            sstable2 = self.sstables.pop(0)
            merged = dict(sorted(list(sstable1.items()) + list(sstable2.items())))
            self.sstables.insert(0, merged)

    def scan_range(self, start_key, end_key):
        """
        Example range query: merge sorted results from all levels.
        """
        results = {}

        # Scan MemTable
        for k, v in self.memtable.items():
            if start_key <= k <= end_key:
                results[k] = v

        # Scan SSTables (newest to oldest — first value wins)
        for sstable in reversed(self.sstables):
            for k, v in sstable.items():
                if start_key <= k <= end_key and k not in results:
                    results[k] = v

        return sorted(results.items())

# Example usage
lsm = LSMTree(memtable_size=3, l0_threshold=2)
lsm.write('key1', 'value1')
lsm.write('key2', 'value2')
lsm.write('key3', 'value3')  # MemTable flush triggered
lsm.write('key4', 'value4')
print(lsm.read('key2'))  # Returns 'value2' from SSTable
print(lsm.scan_range('key1', 'key3'))  # Range query example

Algorithm Diagram:

Cassandra

Cassandra uses LSM as its primary storage engine with a SizeTieredCompaction strategy optimized for write-heavy workloads. It accumulates MemTables (Skiplists) and flushes them to immutable SSTables on disk. Configuration is tunable: memtable_flush_writers controls parallelism, and compaction_throughput_mb_per_sec limits compaction overhead during peak traffic. For read-heavy scenarios, administrators can switch to LeveledCompaction (smaller reads amplification) at the cost of higher write amplification.

RocksDB

Facebook’s RocksDB is a modern fork of LevelDB that adds Universal Compaction (another strategy), multi-column families, and block cache optimization. It powers MyRocks (MySQL variant) and CockroachDB’s storage layer. RocksDB supports both block-based and plain table formats; block-based is default but trades CPU for I/O efficiency. Many cloud systems embed RocksDB for high write throughput without operational burden of a separate storage service.

LevelDB

Google’s LevelDB is the original LSM implementation, introducing level-based compaction: each level k is ~10x larger than level k-1, and compaction moves data upward in a controlled way. It’s a foundational piece used in Chrome (IndexedDB), many embedded databases, and influenced the entire LSM ecosystem. LevelDB’s simplicity makes it ideal for education and embedded systems.

InfluxDB (TSM — Time-Structured Merge Tree)

InfluxDB extends LSM with TSM, a time-series variant where data is columnar and time-bucketed: separate MemTables per shard (hour or day), flush to read-only cache+disk, compaction merges shards by time range. This columnar layout is more cache-efficient for time-series queries (e.g., “all temperature readings”) than key-value LSM. Configuration: cache-max-memory-bytes, compact-threshold-ratio, compact-min-file-count.

Kafka

Kafka uses log-structured storage inspired by LSM principles: append-only segment files (not exactly SSTable), periodic compaction removes old keys and keeps only latest versions (log compaction mode). Each partition is an immutable log divided into segments; compaction consolidates and purges. Broker configuration log.retention.hours and log.cleanup.policy control behavior. This design enables Kafka to handle massive append throughput (billions of messages) with minimal random I/O.

References

O’Neil et al. (1996) — The Log-Structured Merge-Tree — Original LSM research paper introducing write batching and level-based compaction
Google Bigtable Paper — Describes SSTable format and compaction strategies used across Google’s infrastructure
RocksDB Wiki & Documentation — Comprehensive guide to modern LSM implementation details, tuning, and compaction strategies
ByteByteGo — The Secret Sauce Behind NoSQL: LSM Tree — Visual walkthrough of write and read paths, compaction trade-offs
Wikipedia: Log-Structured Merge-Tree — Quick reference and historical context

Software Architecture, Data Structures

algorithms

This post is licensed under CC BY 4.0 by the author.