Back to DAG

SSTable

databases

What is an SSTable?

A Sorted String Table (SSTable) is an immutable, on-disk file containing key-value pairs stored in sorted order by key. SSTables are the persistent storage layer of LSM-tree databases. When a MemTable is flushed, it produces one SSTable.

Structure of an SSTable

An SSTable file is divided into several sections:

  1. Data blocks — the bulk of the file. Each data block contains a sequence of sorted key-value entries. Blocks are typically 4-64 KB. Within a block, entries are stored sequentially and may use prefix compression (common key prefixes are stored once).
  2. Index block — a sparse index with one entry per data block. Each entry records the last key (or first key) in the block and the block's byte offset in the file. This enables binary search over blocks.
  3. Metadata block — stores statistics like the number of entries, min/max keys, compression type, and a Bloom filter for the entire SSTable.
  4. Footer — fixed-size record at the end of the file pointing to the index block and metadata block offsets.

Lookup in an SSTable

To find a key in a single SSTable:

  1. Bloom filter check — consult the SSTable's Bloom filter. If it says "definitely not here", skip this entire file. This avoids any I/O for SSTables that do not contain the key.
  2. Binary search the sparse index — find the data block that could contain the key by binary searching the index block (which is usually cached in memory).
  3. Scan the data block — read the identified data block from disk and scan it sequentially (or binary search within it) for the target key.

Immutability

SSTables are never modified after creation. This design has several advantages:

  • No concurrency control needed for reads — multiple threads can read the same SSTable without locks.
  • Simple caching — an SSTable's contents never change, so cached blocks are always valid.
  • Easy to reason about correctness — the file is a snapshot in time.

Merging SSTables (compaction)

Over time, multiple SSTables accumulate with overlapping key ranges. Compaction merges multiple SSTables into a new, larger SSTable using an N-way merge of sorted sequences. During the merge, duplicate keys are resolved (the newest version wins), and tombstones remove deleted entries. The old SSTables are deleted after the merge completes.

Why sorted order?

Sorted order enables:

  • Efficient range queries — scan consecutive keys without jumping around.
  • Fast merging — N-way merge of sorted files is O(total entries).
  • Binary search — O(log n) point lookups within the sparse index.
  • Sequential disk writes during flush — writing a sorted stream is maximally efficient.
OperationTime complexityNotes
Point lookupO(log B) + disk readB = number of data blocks
Range scanO(k) per key in rangeSequential read
Merge (compaction)O(n)n = total entries across input SSTables
Write (flush)O(n)Sequential write from sorted MemTable

Real-Life: LevelDB SSTable Format

Real-World Example

LevelDB (created by Google, the foundation of RocksDB) uses SSTables as its on-disk format. Here is what a real SSTable looks like:

File layout:

[data block 0] [data block 1] ... [data block N]
[meta block (bloom filter)]
[meta index block]
[index block]
[footer (48 bytes)]

Data blocks: each block is ~4 KB. Keys use prefix compression — if consecutive keys share a prefix, the shared prefix is stored once. For example, storing user:1000, user:1001, user:1002 only writes the differing suffix for the last two keys.

Point lookup example:

  • Query: get("user:5000")
  • Step 1: check the Bloom filter → "maybe present" (all bits set).
  • Step 2: binary search the index block → data block 12 contains keys in range [user:4990, user:5050].
  • Step 3: read data block 12 from disk (4 KB read), scan for "user:5000" → found, return the value.

Compaction example:

  • Level 0 has four SSTables from recent flushes, with overlapping key ranges.
  • LevelDB merges them into Level 1 SSTables with non-overlapping key ranges.
  • During the merge, if "user:100" appears in two SSTables, the newer version wins.
  • If "user:200" has a tombstone in a newer SSTable, the key-value pair is dropped entirely.

SSTable File Structure

SSTable File Layout Block 0 a:1 b:2 c:3 d:4 e:5 Block 1 f:6 g:7 h:8 i:9 j:10 Block 2 k:11 m:12 n:13 p:14 Block 3 q:15 r:16 s:17 z:99 Bloom Filter (meta block) Sparse Index Block Footer Sparse Index (one entry per block) e → offset 0 j → offset 4096 p → offset 8192 z → offset 12288 Lookup: get("h") 1. Bloom filter → "maybe present" 2. Binary search index: "h" ≤ "j" → Block 1 (offset 4096) 3. Read Block 1, scan → found h:8 SSTables are immutable — never modified after creation. Concurrent reads need no locks.
Step 1 of 3