Back to DAG

LSM Tree

databases

What is an LSM Tree?

A Log-Structured Merge Tree (LSM Tree) is a write-optimized data structure that converts random writes into sequential writes, achieving extremely high write throughput. It is the storage engine behind LevelDB, RocksDB, Cassandra, HBase, CockroachDB, and many other modern databases.

Write Path

  1. Append to WAL: Every write is first appended to a Write-Ahead Log on disk for durability. This is a sequential write, which is fast even on spinning disks.
  2. Insert into Memtable: The write is then inserted into an in-memory sorted data structure (typically a skip list or red-black tree) called the memtable.
  3. Flush to SSTable: When the memtable reaches a size threshold, it is written out to disk as a sorted, immutable file called an SSTable (Sorted String Table) at Level 0 (L0). A new empty memtable is created.

Read Path

Reading is more complex because data may be spread across the memtable and multiple levels of SSTables:

  1. Check the memtable first (fastest, in memory).
  2. Check L0 SSTables — these may have overlapping key ranges, so all must be checked.
  3. Check L1, L2, ... SSTables — at each level, SSTables have non-overlapping key ranges, so binary search finds the right file, then search within it.
  4. Bloom filters attached to each SSTable quickly rule out files that definitely do not contain the key, avoiding unnecessary disk reads.

Compaction

Over time, SSTables accumulate. Compaction merges SSTables from one level into the next, combining, sorting, and discarding deleted/overwritten entries. This keeps read amplification bounded and reclaims space from tombstones (deletion markers). Two common strategies are size-tiered compaction (merge similarly-sized SSTables) and leveled compaction (merge L_i into L_{i+1} maintaining non-overlapping ranges).

Trade-offs

AspectLSM TreeB-Tree
Write throughputVery highModerate
Read latencyHigherLower
Space amplificationHigherLower
Write amplificationHigher (compaction)Lower

LSM trees trade read performance and space for superior write throughput. They excel in write-heavy workloads like time-series data, logging, and messaging systems.

Real-Life: How RocksDB Powers Modern Systems

Real-World Example

RocksDB, developed by Facebook, is the most widely-used LSM tree implementation. It powers an enormous range of systems:

Write path in action:

  • A social media platform ingests millions of events per second (likes, comments, views). Each event is appended to the WAL and inserted into the memtable. Memtables flush to L0 SSTables every few seconds. Background compaction continuously merges levels.

Read path in action:

  • When you load a user's profile, RocksDB checks the memtable, then L0, then L1, etc. Bloom filters skip most SSTables immediately. For a recently-written key, it is often found in the memtable or L0 — very fast.

Systems built on LSM trees:

  • CockroachDB / TiKV: use RocksDB (or its fork Pebble) as the storage engine for distributed SQL
  • Cassandra: uses an LSM tree for each node's local storage, handling massive write volumes for time-series and IoT data
  • HBase: Hadoop's LSM-based column store for petabyte-scale analytics
  • LevelDB: Google's lightweight LSM implementation, used in Chrome's IndexedDB

LSM Tree: Write and Read Paths

Write Path WAL sequential Memtable in-memory flush SST SST SST L0 (overlapping) compact SST [a-m] SST [n-z] L1 (non-overlapping) [a-f] [g-p] [q-z] L2 (larger) Read Path: get("key") 1. Check memtable 2. Check L0 SSTables (all, with bloom filters) 3. Check L1, L2, ... (binary search + bloom filter) 4. Return first match found (most recent version) Bloom Filter Bloom BF

Interactive LSM Tree

Loading demo...
Step 1 of 2