What is an LSM Tree?

Prerequisites(5)

MemTable SSTable Compaction (leveled vs tiered)Write-Ahead Log (WAL)Bloom Filter

Builds on to(1)

A Log-Structured Merge Tree (LSM Tree) is a write-optimized data structure that converts random writes into sequential writes, achieving extremely high write throughput. It is the storage engine behind LevelDB, RocksDB, Cassandra, HBase, CockroachDB, and many other modern databases.

Write Path

Append to WAL: Every write is first appended to a Write-Ahead Log on disk for durability. This is a sequential write, which is fast even on spinning disks.
Insert into Memtable: The write is then inserted into an in-memory sorted data structure (typically a skip list or red-black tree) called the memtable.
Flush to SSTable: When the memtable reaches a size threshold, it is written out to disk as a sorted, immutable file called an SSTable (Sorted String Table) at Level 0 (L0). A new empty memtable is created.

Read Path

Reading is more complex because data may be spread across the memtable and multiple levels of SSTables:

Check the memtable first (fastest, in memory).
Check L0 SSTables — these may have overlapping key ranges, so all must be checked.
Check L1, L2, ... SSTables — at each level, SSTables have non-overlapping key ranges, so binary search finds the right file, then search within it.
Bloom filters attached to each SSTable quickly rule out files that definitely do not contain the key, avoiding unnecessary disk reads.

Compaction

Over time, SSTables accumulate. Compaction merges SSTables from one level into the next, combining, sorting, and discarding deleted/overwritten entries. This keeps read amplification bounded and reclaims space from tombstones (deletion markers). Two common strategies are size-tiered compaction (merge similarly-sized SSTables) and leveled compaction (merge L_i into L_{i+1} maintaining non-overlapping ranges).

Trade-offs

Aspect	LSM Tree	B-Tree
Write throughput	Very high	Moderate
Read latency	Higher	Lower
Space amplification	Higher	Lower
Write amplification	Higher (compaction)	Lower

LSM trees trade read performance and space for superior write throughput. They excel in write-heavy workloads like time-series data, logging, and messaging systems.

Real-Life: How RocksDB Powers Modern Systems

Real-World Example

RocksDB, developed by Facebook, is the most widely-used LSM tree implementation. It powers an enormous range of systems:

Write path in action:

A social media platform ingests millions of events per second (likes, comments, views). Each event is appended to the WAL and inserted into the memtable. Memtables flush to L0 SSTables every few seconds. Background compaction continuously merges levels.

Read path in action:

When you load a user's profile, RocksDB checks the memtable, then L0, then L1, etc. Bloom filters skip most SSTables immediately. For a recently-written key, it is often found in the memtable or L0 — very fast.

Systems built on LSM trees:

CockroachDB / TiKV: use RocksDB (or its fork Pebble) as the storage engine for distributed SQL
Cassandra: uses an LSM tree for each node's local storage, handling massive write volumes for time-series and IoT data
HBase: Hadoop's LSM-based column store for petabyte-scale analytics
LevelDB: Google's lightweight LSM implementation, used in Chrome's IndexedDB

LSM Tree: Write and Read Paths

Interactive LSM Tree

InteractiveWrite keys to a memtable, flush to SSTables, compact levels

Memtable (0/4)

empty

Level 0 SSTables (0)empty

Level 1

empty

Loading demo...

Step 1 of 2