Sorted Arrays and Binary Search
A sorted array is the simplest data structure that supports efficient search. By maintaining elements in sorted order, we can use binary search to find any element in O(log n) time — far better than the O(n) linear scan required for unsorted data.
Binary search algorithm
- Set low = 0, high = length - 1.
- Compute mid = low + (high - low) / 2 (avoids integer overflow compared to (low + high) / 2).
- If array[mid] == target, return mid.
- If array[mid] < target, search the right half: low = mid + 1.
- If array[mid] > target, search the left half: high = mid - 1.
- If low > high, the element is not present.
Binary search variants
Binary search is not just about exact matches. Two critical variants are used throughout database internals:
-
Lower bound — find the first index where array[i] >= key. This is used to find the insertion point, or the start of a range query. When the exact key is not present, it returns where the key would go.
-
Upper bound — find the first index where array[i] > key. Combined with lower bound, this defines a range: all elements equal to key live in [lower_bound, upper_bound).
Insertion cost
The downside of sorted arrays is insertion: O(n). To insert a new element, you must shift all larger elements one position to the right to make room. This makes sorted arrays impractical as a mutable data structure for large datasets — but perfect for immutable, write-once structures.
Merging two sorted arrays
Given two sorted arrays of sizes n and m, they can be merged into a single sorted array in O(n + m) time using the classic merge algorithm: maintain two pointers, always take the smaller element, advance that pointer.
Where sorted arrays appear in databases
- SSTable data blocks: each block in an SSTable (Sorted String Table) is a sorted array of key-value pairs. Lookups within a block use binary search.
- B-tree node keys: the keys within a single B-tree node are a sorted array. The node uses binary search to route queries to the correct child.
- Sorted runs in merge sort: external merge sort produces sorted runs that are merged — the same sorted-array merge algorithm.
- Interpolation search: for uniformly distributed data, you can estimate the position as low + ((target - array[low]) * (high - low)) / (array[high] - array[low]), achieving expected O(log log n).
Real-Life: Binary Search in SSTable Blocks
In an LSM-tree database like LevelDB or RocksDB, data is stored in SSTables — immutable files of sorted key-value pairs. Each SSTable is divided into blocks (typically 4 KB).
Reading a key from an SSTable block:
- The block contains a sorted array of keys with their values.
- The database performs binary search over the keys to find the target.
- If found, it returns the value. If not, it knows the key is not in this block.
Building an SSTable (flush from memtable):
- The memtable (a sorted in-memory structure like a skip list) is iterated in order.
- Keys are written sequentially into SSTable blocks — already sorted.
- No insertion shifting needed because the file is written once, sequentially.
Compaction (merging SSTables):
- Two sorted SSTables are merged into one, using the two-pointer merge algorithm.
- This is O(n + m) — the same merge algorithm used in merge sort.
This pattern — build sorted arrays in bulk, search with binary search, merge when needed — is the foundation of LSM-tree storage engines used in Cassandra, RocksDB, LevelDB, and ScyllaDB.