What is Compaction?
Compaction is the process of merging and reorganizing SSTables (Sorted String Tables) in an LSM-tree storage engine. It is essential for maintaining read performance, reclaiming disk space from deleted or overwritten keys, and bounding the number of files the database must search.
Why compaction is necessary
In an LSM-tree, writes go to an in-memory buffer (memtable), which is periodically flushed to disk as an immutable SSTable. Over time, this creates many SSTables. A point query might need to search all of them (from newest to oldest) to find the latest version of a key. Without compaction, read performance degrades linearly as SSTables accumulate.
Compaction merges multiple SSTables into fewer, larger ones. During the merge, it:
- Deduplicates: keeps only the latest version of each key.
- Removes tombstones: deletes that were recorded as markers can be discarded once all older versions are merged away.
- Sorts and consolidates: the output SSTables are sorted, enabling efficient binary search.
The amplification tradeoff
Every compaction strategy makes a tradeoff between three costs:
- Write amplification: total bytes written to disk / bytes written by the user. Compaction rewrites data, so the same key may be written many times across its lifetime.
- Read amplification: number of SSTables that must be checked for a point query. Fewer, well-organized SSTables = lower read amplification.
- Space amplification: total disk space used / actual live data size. Temporary duplicates during compaction increase space usage.
Size-Tiered Compaction (STCS)
SSTables of similar size are grouped into tiers. When a tier accumulates enough SSTables (e.g., 4), they are merged into a single larger SSTable in the next tier.
- Good write amplification: each SSTable is rewritten roughly O(log n) times.
- Bad space amplification: multiple SSTables at the same tier can contain overlapping key ranges, and the merge temporarily doubles space usage.
- Bad read amplification: multiple SSTables in the same tier may cover the same key range, so all must be searched.
- Used by: Apache Cassandra (default), HBase.
Leveled Compaction (LCS)
Data is organized into levels (L0, L1, L2, ...) where each level is roughly 10x larger than the previous. Within each level (except L0), key ranges do not overlap — at most one SSTable covers any given key range.
- Good read amplification: for any key, at most one SSTable per level needs to be checked (plus all L0 tables). With a Bloom filter, even non-existent keys are quickly ruled out.
- Bad write amplification: merging a single SSTable from level L into level L+1 may require rewriting several SSTables in L+1 that overlap with it. Write amplification is O(10) per level, or O(10 * num_levels) total.
- Good space amplification: ~10% overhead (only the compaction-in-progress tables are temporarily duplicated).
- Used by: LevelDB, RocksDB (default), Cassandra (option).
Universal Compaction
RocksDB also offers universal compaction, a hybrid that dynamically chooses merge candidates to balance write and space amplification. It is particularly effective for write-heavy workloads with large datasets.
Real-Life: RocksDB Leveled Compaction
RocksDB (used by Meta's MySQL variant MyRocks, CockroachDB, TiDB, and many others) uses leveled compaction by default. Here is how it works in practice:
Level sizes (with 10x multiplier and 64MB L1 target):
- L0: small flushed SSTables (say, 64MB total)
- L1: 64MB (10 SSTables of ~6.4MB each)
- L2: 640MB
- L3: 6.4GB
- L4: 64GB
- L5: 640GB
A typical compaction sequence:
- A memtable flush produces a new SSTable in L0.
- When L0 has 4 SSTables, they are merged into L1. Because L0 tables can have overlapping key ranges, all 4 must be merged simultaneously.
- When L1 exceeds 64MB, one SSTable from L1 is picked and merged with all overlapping SSTables in L2.
- The merge produces new SSTables in L2. The old input SSTables are deleted.
- This cascades: if L2 exceeds its target, an SSTable is compacted down to L3, and so on.
Write amplification example: A key written by the user is first written to the WAL, then to the memtable, then flushed to L0, then compacted to L1, L2, L3, and so on. With a 10x size ratio and 5 levels, the write amplification is roughly 1 (WAL) + 1 (flush) + 10*4 (levels) = ~42x. This means for every 1 GB the user writes, ~42 GB of disk I/O occurs.