Back to DAG

MemTable

databases

What is a MemTable?

A MemTable is an in-memory sorted data structure that serves as the write buffer in an LSM-tree (Log-Structured Merge-tree) storage engine. All writes — inserts, updates, and deletes — go to the MemTable first, making writes extremely fast because they never touch disk directly.

How it works

  1. Write path: a client writes key-value pair (k, v). The entry is appended to the write-ahead log (WAL) on disk for durability, then inserted into the MemTable. The WAL append is sequential I/O (fast), and the MemTable insert is an in-memory operation.
  2. Read path: to read key k, first check the active MemTable (most recent data). If not found, check any immutable MemTables being flushed. If still not found, search on-disk SSTables.
  3. Flush: when the MemTable exceeds a size threshold (e.g., 64 MB), it is marked immutable and a background thread begins writing it to disk as a new SSTable (Sorted String Table). A fresh, empty MemTable immediately takes its place so writes are never blocked.

Internal data structure

The MemTable is typically implemented as a skip list or a red-black tree — both provide O(log n) insert and lookup while maintaining sorted order. Sorted order is critical because the flush operation must produce a sorted SSTable, and an already-sorted MemTable enables a simple sequential write during flush.

Skip lists are the most popular choice (used by LevelDB, RocksDB, and Cassandra) because they offer better concurrent write performance than balanced trees — concurrent inserts only need local locking rather than tree rebalancing.

Immutable MemTable during flush

When a MemTable is full, it becomes immutable — no new writes go to it. However, concurrent reads can still access it. This is important because the flush to disk takes time (hundreds of milliseconds to seconds), and queries during that window might need data that has not yet been written to the SSTable. The system may hold multiple immutable MemTables if flushes fall behind.

Deletes are writes

In an LSM-tree, a delete does not remove data from the MemTable. Instead, it writes a tombstone — a special marker indicating the key has been deleted. The actual data is removed later during compaction when SSTables are merged.

WAL protection

The WAL ensures no data is lost if the system crashes before a MemTable is flushed. On recovery, the database replays the WAL to rebuild any MemTable that was in memory at crash time. Once a MemTable is successfully flushed to an SSTable, its WAL segment can be discarded.

OperationTime complexityNotes
Put(k,v)O(log n)In-memory insert
Get(k)O(log n)In-memory lookup
Delete(k)O(log n)Writes a tombstone
FlushO(n)Sequential write to disk

Real-Life: RocksDB Write Path

Real-World Example

RocksDB (used by Meta, Netflix, LinkedIn, and many others) uses MemTables as the core of its write path:

  1. A client calls db.put("user:123", "{name: Alice}").
  2. RocksDB appends the entry to the WAL file on disk (sequential write, ~10 microseconds on SSD).
  3. The entry is inserted into the active MemTable (skip list insert, ~1 microsecond).
  4. The put() returns success. The entire write took about 11 microseconds — no random disk I/O at all.

Flush trigger: RocksDB's default MemTable size is 64 MB. When the active MemTable reaches this size:

  • It is sealed and becomes immutable.
  • A new empty MemTable is created.
  • A background thread writes the immutable MemTable to a Level-0 SSTable.
  • The WAL segment for the flushed MemTable is deleted.

Concurrency: RocksDB allows one writer at a time to the active MemTable (using a write batch lock), but multiple readers can concurrently read from both the active and immutable MemTables without blocking.

Why this is fast: traditional B-tree databases (like PostgreSQL or MySQL InnoDB) must perform random I/O to update pages on disk. An LSM-tree converts all writes to sequential I/O (WAL append + SSTable flush), which is 100x faster on HDDs and 10x faster on SSDs.

MemTable Write and Flush Flow

Write Path put("d", 4) WAL (append) sequential disk write Active MemTable "a" → 1 "b" → 2 "c" → 3 "d" → 4 ← new "f" → 6 sorted by key (skip list) When MemTable reaches size threshold: Immutable MemTable "a"→1 "b"→2 "c"→3 "d"→4 "f"→6 reads still served flush SSTable (on disk) [a:1][b:2][c:3][d:4][f:6] sorted, immutable file New Active MemTable (empty, accepting writes) On crash: replay WAL to rebuild any unflushed MemTable. Once flush completes, WAL segment is discarded.
Step 1 of 2