How Sparse Indexes Work
A sparse index stores one entry per data block rather than one entry per record. Each entry records a key (typically the last or first key in the block) and the byte offset of that block on disk. This dramatically reduces index size — instead of indexing millions of records, the index only tracks thousands of blocks.
Sparse vs. dense index
A dense index has one entry for every key in the data file. It allows direct lookup of any key but can become very large. A sparse index has one entry per block (or page), requiring the reader to scan within a block after locating it. The tradeoff:
- Dense index: O(log n) lookup where n = number of records. Large index, may not fit in memory.
- Sparse index: O(log B) lookup where B = number of blocks, then O(block_size) scan within the block. Much smaller index, easily fits in memory.
For SSTables, the sparse approach is preferred because data blocks are small (4-64 KB) and scanning a single block is fast. Meanwhile, the index itself stays tiny enough to be cached entirely in memory.
Lookup algorithm
- Bloom filter — first, check whether the key might exist in this SSTable at all. If the Bloom filter says "definitely not", skip the file entirely.
- Binary search the sparse index — find the largest index entry whose key is less than or equal to the target key (or the smallest entry whose key is greater than or equal to the target). This identifies the data block.
- Read and scan the data block — load the block from disk (one I/O operation) and scan sequentially for the target key. Since the block is sorted, you could also binary search within it, though sequential scan is often faster for small blocks due to CPU cache friendliness.
Two-level indexing
For very large SSTables (gigabytes), even the sparse index can become large. A two-level index solves this:
- Top-level index — a small index (fits in memory) that points to index blocks.
- Index blocks — each index block covers a range of data blocks and is itself stored on disk.
- Data blocks — the actual key-value data.
Lookup: binary search the top-level index to find the right index block, read that index block, binary search it to find the right data block, then read and scan the data block. This adds one extra I/O but enables indexing terabyte-scale SSTables.
Index block caching
In practice, frequently accessed index blocks are cached in the block cache (a separate LRU cache from the buffer pool). Since the sparse index is read on nearly every lookup, it has high temporal locality and stays resident in cache. The combination of Bloom filter + cached sparse index means most point lookups require at most one disk read (for the data block).
| Index type | Entries | Memory usage | Lookups to find block |
|---|---|---|---|
| Dense | n (per record) | Large | O(log n) |
| Sparse (1-level) | B (per block) | Small | O(log B) |
| Sparse (2-level) | B + top-level | Very small (top in memory) | O(log T) + 1 I/O + O(log B') |
Real-Life: RocksDB Block-Based Table Index
RocksDB stores SSTables in its "block-based table format". The index works as follows:
One-level index (default):
- Each data block is ~4 KB. An SSTable might be 64 MB, containing ~16,000 data blocks.
- The sparse index has ~16,000 entries, each being a key + 8-byte offset. If average key size is 20 bytes, the index is ~450 KB — easily fits in the block cache.
- On a lookup, RocksDB binary searches this index in memory (zero disk I/O), then reads exactly one 4 KB data block from disk.
Two-level index (partitioned index):
- For very large SSTables (256 MB+), RocksDB supports partitioned indexes.
- The top-level index is ~10 KB and is always cached.
- Each index partition covers ~1,000 data blocks and is ~28 KB.
- On a lookup: binary search the top-level index (in memory), read one index partition (one I/O or cache hit), binary search it (in memory), read one data block (one I/O).
Performance impact:
- Without index caching: 2 disk reads per lookup (index block + data block).
- With cached index: 1 disk read per lookup (data block only).
- With Bloom filter rejecting 99% of misses: average 0.01 disk reads for keys not in the SSTable.
Example numbers for a 100 GB database with 1 billion keys:
- ~1,600 SSTables at 64 MB each.
- Per SSTable: ~16,000 index entries, ~450 KB index.
- Total index memory: ~700 MB. With block cache, this fits comfortably in memory.
- Point lookup: check ~5-10 Bloom filters (one per level), then 1 data block read. Total: ~1 disk I/O per lookup.