What is an LSM Tree?
A Log-Structured Merge Tree (LSM Tree) is a write-optimized data structure that converts random writes into sequential writes, achieving extremely high write throughput. It is the storage engine behind LevelDB, RocksDB, Cassandra, HBase, CockroachDB, and many other modern databases.
Write Path
- Append to WAL: Every write is first appended to a Write-Ahead Log on disk for durability. This is a sequential write, which is fast even on spinning disks.
- Insert into Memtable: The write is then inserted into an in-memory sorted data structure (typically a skip list or red-black tree) called the memtable.
- Flush to SSTable: When the memtable reaches a size threshold, it is written out to disk as a sorted, immutable file called an SSTable (Sorted String Table) at Level 0 (L0). A new empty memtable is created.
Read Path
Reading is more complex because data may be spread across the memtable and multiple levels of SSTables:
- Check the memtable first (fastest, in memory).
- Check L0 SSTables — these may have overlapping key ranges, so all must be checked.
- Check L1, L2, ... SSTables — at each level, SSTables have non-overlapping key ranges, so binary search finds the right file, then search within it.
- Bloom filters attached to each SSTable quickly rule out files that definitely do not contain the key, avoiding unnecessary disk reads.
Compaction
Over time, SSTables accumulate. Compaction merges SSTables from one level into the next, combining, sorting, and discarding deleted/overwritten entries. This keeps read amplification bounded and reclaims space from tombstones (deletion markers). Two common strategies are size-tiered compaction (merge similarly-sized SSTables) and leveled compaction (merge L_i into L_{i+1} maintaining non-overlapping ranges).
Trade-offs
| Aspect | LSM Tree | B-Tree |
|---|---|---|
| Write throughput | Very high | Moderate |
| Read latency | Higher | Lower |
| Space amplification | Higher | Lower |
| Write amplification | Higher (compaction) | Lower |
LSM trees trade read performance and space for superior write throughput. They excel in write-heavy workloads like time-series data, logging, and messaging systems.
Real-Life: How RocksDB Powers Modern Systems
RocksDB, developed by Facebook, is the most widely-used LSM tree implementation. It powers an enormous range of systems:
Write path in action:
- A social media platform ingests millions of events per second (likes, comments, views). Each event is appended to the WAL and inserted into the memtable. Memtables flush to L0 SSTables every few seconds. Background compaction continuously merges levels.
Read path in action:
- When you load a user's profile, RocksDB checks the memtable, then L0, then L1, etc. Bloom filters skip most SSTables immediately. For a recently-written key, it is often found in the memtable or L0 — very fast.
Systems built on LSM trees:
- CockroachDB / TiKV: use RocksDB (or its fork Pebble) as the storage engine for distributed SQL
- Cassandra: uses an LSM tree for each node's local storage, handling massive write volumes for time-series and IoT data
- HBase: Hadoop's LSM-based column store for petabyte-scale analytics
- LevelDB: Google's lightweight LSM implementation, used in Chrome's IndexedDB