Sorted Arrays and Binary Search

Builds on to(7)

Binary Search Tree SSTable Inverted Index IVF + PQ (vector quantization)Relational Model (Codd)Join Algorithms (NLJ, Hash, Merge)Column Statistics & Histograms

A sorted array is the simplest data structure that supports efficient search. By maintaining elements in sorted order, we can use binary search to find any element in O(log n) time — far better than the O(n) linear scan required for unsorted data.

Binary search algorithm

Set low = 0, high = length - 1.
Compute mid = low + (high - low) / 2 (avoids integer overflow compared to (low + high) / 2).
If array[mid] == target, return mid.
If array[mid] < target, search the right half: low = mid + 1.
If array[mid] > target, search the left half: high = mid - 1.
If low > high, the element is not present.

Binary search variants

Binary search is not just about exact matches. Two critical variants are used throughout database internals:

Lower bound — find the first index where array[i] >= key. This is used to find the insertion point, or the start of a range query. When the exact key is not present, it returns where the key would go.
Upper bound — find the first index where array[i] > key. Combined with lower bound, this defines a range: all elements equal to key live in [lower_bound, upper_bound).

Insertion cost

The downside of sorted arrays is insertion: O(n). To insert a new element, you must shift all larger elements one position to the right to make room. This makes sorted arrays impractical as a mutable data structure for large datasets — but perfect for immutable, write-once structures.

Merging two sorted arrays

Given two sorted arrays of sizes n and m, they can be merged into a single sorted array in O(n + m) time using the classic merge algorithm: maintain two pointers, always take the smaller element, advance that pointer.

Where sorted arrays appear in databases

SSTable data blocks: each block in an SSTable (Sorted String Table) is a sorted array of key-value pairs. Lookups within a block use binary search.
B-tree node keys: the keys within a single B-tree node are a sorted array. The node uses binary search to route queries to the correct child.
Sorted runs in merge sort: external merge sort produces sorted runs that are merged — the same sorted-array merge algorithm.
Interpolation search: for uniformly distributed data, you can estimate the position as low + ((target - array[low]) * (high - low)) / (array[high] - array[low]), achieving expected O(log log n).

Real-Life: Binary Search in SSTable Blocks

Real-World Example

In an LSM-tree database like LevelDB or RocksDB, data is stored in SSTables — immutable files of sorted key-value pairs. Each SSTable is divided into blocks (typically 4 KB).

Reading a key from an SSTable block:

The block contains a sorted array of keys with their values.
The database performs binary search over the keys to find the target.
If found, it returns the value. If not, it knows the key is not in this block.

Building an SSTable (flush from memtable):

The memtable (a sorted in-memory structure like a skip list) is iterated in order.
Keys are written sequentially into SSTable blocks — already sorted.
No insertion shifting needed because the file is written once, sequentially.

Compaction (merging SSTables):

Two sorted SSTables are merged into one, using the two-pointer merge algorithm.
This is O(n + m) — the same merge algorithm used in merge sort.

This pattern — build sorted arrays in bulk, search with binary search, merge when needed — is the foundation of LSM-tree storage engines used in Cassandra, RocksDB, LevelDB, and ScyllaDB.

Binary Search: Finding Key 42

Step 1 of 3

Sorted Array & Binary Search