From Tuple-at-a-Time to Vectorized Processing

Prerequisites(2)

Columnar Storage (OLAP)SIMD / Vectorized Instructions

Traditional database engines use the Volcano (iterator) model: each operator in the query plan implements a next() method that returns one tuple at a time. The top operator calls next() on its child, which calls next() on its child, and so on down the tree. While elegant and composable, this model has a critical performance problem: every single tuple incurs a virtual function call at every operator boundary. For a query scanning 100 million rows through 5 operators, that is 500 million virtual function calls — each one causing a branch misprediction and instruction cache miss.

The Vectorized Approach

Vectorized execution replaces the one-tuple-at-a-time model with a batch-at-a-time model. Instead of next() returning a single row, it returns a vector (array) of values — typically 1024 tuples at once. Each operator processes the entire batch before passing it to the next operator. This dramatically reduces per-tuple overhead: the function call cost is amortized across 1024 tuples instead of 1.

Columnar In-Memory Format

Vectorized engines store each column of the batch in a contiguous array (columnar layout). When evaluating WHERE price > 100, the engine iterates over a tight array of price values rather than jumping between row records. This layout is SIMD-friendly: modern CPUs can compare 4 or 8 values simultaneously using SIMD instructions (SSE, AVX-256, AVX-512). A single AVX-512 instruction can compare eight 64-bit integers in one CPU cycle.

Predicate Evaluation on Vectors

Instead of evaluating price > 100 AND quantity < 50 row by row, the vectorized engine:

Evaluates price > 100 on the entire price column vector, producing a selection vector (bitmask) of qualifying positions.
Evaluates quantity < 50 only on positions that passed the first filter.
The selection vector is passed to subsequent operators, avoiding materializing intermediate results.

Why Batch Size 1024?

The batch size is chosen to fit in L1/L2 cache while being large enough to amortize function-call overhead. A vector of 1024 64-bit values is 8 KB — well within a 32 KB L1 data cache. Larger batches would spill to L2 or L3, losing the cache locality benefit. Smaller batches do not amortize overhead enough.

Systems Using Vectorized Execution

DuckDB is a fully vectorized analytical database. ClickHouse uses vectorized execution for its column-oriented storage. Meta's Velox is a vectorized execution library used by Presto and Spark. MonetDB/X100 (now VectorWise) pioneered the approach in the academic paper that started it all.

Volcano vs. Vectorized: Performance Comparison

Real-World Example

Consider a simple query: SELECT SUM(price) FROM orders WHERE quantity > 10. The table has 100 million rows.

Volcano Model (tuple-at-a-time):

SUM operator calls next() → Filter calls next() → Scan calls next()
For EACH of 100M rows:
  - Scan fetches 1 row from storage (function call #1)
  - Filter checks quantity > 10 (function call #2)
  - If passes, SUM adds price (function call #3)
Total: ~300M virtual function calls

Vectorized Model (batch-at-a-time):

SUM calls next() → Filter calls next() → Scan calls next()
For EACH batch of 1024 rows (~97,657 batches):
  - Scan fetches 1024 rows into column vectors (1 function call)
  - Filter evaluates quantity > 10 on entire vector using SIMD (1 function call)
  - SUM aggregates matching prices (1 function call)
Total: ~293K function calls (1000x fewer)

Real-world benchmarks show 5-10x speedup from vectorized execution alone, even without changing the storage format. The speedup comes from:

Fewer function calls: 1000x reduction
Better branch prediction: tight loops have predictable branches
SIMD utilization: comparing/summing 4-8 values per instruction
Cache locality: column vectors fit in L1 cache

DuckDB consistently outperforms SQLite and even PostgreSQL on analytical queries precisely because of its vectorized engine, despite being an embedded database with no server overhead.

Volcano vs. Vectorized Execution

Step 1 of 2