What is a MemTable?
A MemTable is an in-memory sorted data structure that serves as the write buffer in an LSM-tree (Log-Structured Merge-tree) storage engine. All writes — inserts, updates, and deletes — go to the MemTable first, making writes extremely fast because they never touch disk directly.
How it works
- Write path: a client writes key-value pair (k, v). The entry is appended to the write-ahead log (WAL) on disk for durability, then inserted into the MemTable. The WAL append is sequential I/O (fast), and the MemTable insert is an in-memory operation.
- Read path: to read key k, first check the active MemTable (most recent data). If not found, check any immutable MemTables being flushed. If still not found, search on-disk SSTables.
- Flush: when the MemTable exceeds a size threshold (e.g., 64 MB), it is marked immutable and a background thread begins writing it to disk as a new SSTable (Sorted String Table). A fresh, empty MemTable immediately takes its place so writes are never blocked.
Internal data structure
The MemTable is typically implemented as a skip list or a red-black tree — both provide O(log n) insert and lookup while maintaining sorted order. Sorted order is critical because the flush operation must produce a sorted SSTable, and an already-sorted MemTable enables a simple sequential write during flush.
Skip lists are the most popular choice (used by LevelDB, RocksDB, and Cassandra) because they offer better concurrent write performance than balanced trees — concurrent inserts only need local locking rather than tree rebalancing.
Immutable MemTable during flush
When a MemTable is full, it becomes immutable — no new writes go to it. However, concurrent reads can still access it. This is important because the flush to disk takes time (hundreds of milliseconds to seconds), and queries during that window might need data that has not yet been written to the SSTable. The system may hold multiple immutable MemTables if flushes fall behind.
Deletes are writes
In an LSM-tree, a delete does not remove data from the MemTable. Instead, it writes a tombstone — a special marker indicating the key has been deleted. The actual data is removed later during compaction when SSTables are merged.
WAL protection
The WAL ensures no data is lost if the system crashes before a MemTable is flushed. On recovery, the database replays the WAL to rebuild any MemTable that was in memory at crash time. Once a MemTable is successfully flushed to an SSTable, its WAL segment can be discarded.
| Operation | Time complexity | Notes |
|---|---|---|
| Put(k,v) | O(log n) | In-memory insert |
| Get(k) | O(log n) | In-memory lookup |
| Delete(k) | O(log n) | Writes a tombstone |
| Flush | O(n) | Sequential write to disk |
Real-Life: RocksDB Write Path
RocksDB (used by Meta, Netflix, LinkedIn, and many others) uses MemTables as the core of its write path:
- A client calls
db.put("user:123", "{name: Alice}"). - RocksDB appends the entry to the WAL file on disk (sequential write, ~10 microseconds on SSD).
- The entry is inserted into the active MemTable (skip list insert, ~1 microsecond).
- The put() returns success. The entire write took about 11 microseconds — no random disk I/O at all.
Flush trigger: RocksDB's default MemTable size is 64 MB. When the active MemTable reaches this size:
- It is sealed and becomes immutable.
- A new empty MemTable is created.
- A background thread writes the immutable MemTable to a Level-0 SSTable.
- The WAL segment for the flushed MemTable is deleted.
Concurrency: RocksDB allows one writer at a time to the active MemTable (using a write batch lock), but multiple readers can concurrently read from both the active and immutable MemTables without blocking.
Why this is fast: traditional B-tree databases (like PostgreSQL or MySQL InnoDB) must perform random I/O to update pages on disk. An LSM-tree converts all writes to sequential I/O (WAL append + SSTable flush), which is 100x faster on HDDs and 10x faster on SSDs.