What is Columnar Storage?

Prerequisites(2)

Slotted Page Layout Compression (gzip, zstd, Brotli)

Builds on to(4)

Vectorized Execution Parquet / ORC OLTP vs OLAP Time-Series DB (InfluxDB)

Columnar storage (also called the Decomposition Storage Model or DSM) stores each column of a table in a separate file or segment. Instead of grouping all columns of one row together, it groups all values of one column together.

Structure

For a table with columns (id, name, salary, department) and 1 million rows:

Row storage: 1 million tuples, each containing all 4 fields, packed onto pages.
Column storage: 4 separate column files. The salary file contains 1 million salary values back-to-back. The name file contains 1 million name values.

Why columnar storage dominates analytics

1. I/O efficiency: An analytical query SELECT AVG(salary) FROM employees reads only the salary column file. If each row has 50 columns totaling 200 bytes, but salary is 8 bytes, column storage reads 25x less data from disk. For large tables, this difference is measured in gigabytes of saved I/O.

2. Excellent compression: Values in the same column share the same data type and often have similar distributions. A column of US state codes (50 distinct values) compresses far better than mixed-type row data. Common techniques:

Dictionary encoding: replace strings with integer codes. "California" (10 bytes) becomes 5 (1 byte).
Run-Length Encoding (RLE): if sorted by department, "Engineering" repeated 10,000 times becomes (Engineering, 10000).
Delta encoding: for timestamps or sequential IDs, store differences between consecutive values.
Bit-packing: a column with values 0-15 needs only 4 bits per value instead of 32.

3. SIMD-friendly: Modern CPUs have SIMD (Single Instruction, Multiple Data) instructions that process 4, 8, or 16 values simultaneously. A column of integers laid out contiguously in memory is a perfect SIMD target. Operations like "filter where salary > 100000" can evaluate 8 values per CPU cycle using AVX2.

4. Late materialization: The query engine can operate on compressed column data as long as possible, only reconstructing full rows at the very end when producing results. This avoids creating intermediate tuples.

Tuple reconstruction

When a query needs multiple columns (SELECT name, salary WHERE salary > 100000), the database must reconstruct tuples by matching positions across column files. Row 42's name is at position 42 in the name file, and its salary is at position 42 in the salary file. This position-based alignment is fast but means column files must stay perfectly synchronized.

When columnar storage struggles

Point queries: SELECT * FROM users WHERE id = 42 must read from every column file to reconstruct one row — many small I/O operations instead of one.
Single-row inserts/updates: modifying one row requires updating every column file. Most columnar databases batch writes and use an in-memory delta store for recent inserts.
Row-level transactions: OLTP-style workloads with concurrent single-row operations are a poor fit.

Columnar databases in practice

Database	Type	Notes
ClickHouse	Column-native	Open-source, extremely fast analytics
DuckDB	Column-native	Embedded analytical database (SQLite for analytics)
Amazon Redshift	Column-native	Cloud data warehouse
BigQuery	Column-native	Serverless analytics (Google)
Parquet	Column file format	Used by Spark, Hive, Presto
Apache Arrow	Column in-memory format	Zero-copy interchange between systems

Real-Life: ClickHouse Analytics

Real-World Example

ClickHouse is a columnar database that powers real-time analytics at Cloudflare, Uber, and eBay. It demonstrates why column storage matters.

Benchmark scenario: A table of 1 billion web analytics events with 30 columns (timestamp, url, user_agent, country, response_time, status_code, ...). Total uncompressed size: ~300 GB.

Query: "Average response time by country for the last hour":

SELECT country, AVG(response_time)
FROM events
WHERE timestamp > now() - INTERVAL 1 HOUR
GROUP BY country;

Row storage (PostgreSQL): must read all 30 columns = ~300 GB scan. Even at SSD speeds (3 GB/s), that is 100 seconds.

Column storage (ClickHouse):

Reads only 3 columns: timestamp (8 bytes), country (2 bytes dict-encoded), response_time (4 bytes).
Total data: 14 bytes * 1B rows = 14 GB uncompressed.
With compression (typically 5-10x): reads ~2 GB from disk.
At SSD speeds: < 1 second.

Compression in action: The country column has ~200 distinct values for 1 billion rows. Dictionary encoding reduces each value to 1 byte. RLE after sorting by timestamp compresses further. The column goes from 2 GB to ~200 MB.

Parquet files bring the same benefits to data lakes. A Spark query over 10 TB of Parquet data with 100 columns, reading only 3 columns, actually scans ~300 GB + excellent compression — making "big data" queries feasible on modest hardware.

**Columnar storage** (also called the **Decomposition Storage Model** or DSM) stores each column of a table in a **separate file or segment**. Instead of grouping all columns of one row together, it groups all values of one column together.

### Structure

For a table with columns (id, name, salary, department) and 1 million rows:

- **Row storage**: 1 million tuples, each containing all 4 fields, packed onto pages.
- **Column storage**: 4 separate column files. The `salary` file contains 1 million salary values back-to-back. The `name` file contains 1 million name values.

### Why columnar storage dominates analytics

**1. I/O efficiency**: An analytical query `SELECT AVG(salary) FROM employees` reads **only** the salary column file. If each row has 50 columns totaling 200 bytes, but salary is 8 bytes, column storage reads **25x less data** from disk. For large tables, this difference is measured in gigabytes of saved I/O.

**2. Excellent compression**: Values in the same column share the same data type and often have similar distributions. A column of US state codes (50 distinct values) compresses far better than mixed-type row data. Common techniques:
- **Dictionary encoding**: replace strings with integer codes. "California" (10 bytes) becomes 5 (1 byte).
- **Run-Length Encoding (RLE)**: if sorted by department, "Engineering" repeated 10,000 times becomes (Engineering, 10000).
- **Delta encoding**: for timestamps or sequential IDs, store differences between consecutive values.
- **Bit-packing**: a column with values 0-15 needs only 4 bits per value instead of 32.

**3. SIMD-friendly**: Modern CPUs have SIMD (Single Instruction, Multiple Data) instructions that process 4, 8, or 16 values simultaneously. A column of integers laid out contiguously in memory is a perfect SIMD target. Operations like "filter where salary   100000" can evaluate 8 values per CPU cycle using AVX2.

**4. Late materialization**: The query engine can operate on compressed column data as long as possible, only reconstructing full rows at the very end when producing results. This avoids creating intermediate tuples.

### Tuple reconstruction

When a query needs multiple columns (`SELECT name, salary WHERE salary   100000`), the database must **reconstruct tuples** by matching positions across column files. Row 42's name is at position 42 in the name file, and its salary is at position 42 in the salary file. This position-based alignment is fast but means column files must stay perfectly synchronized.

### When columnar storage struggles

- **Point queries**: `SELECT * FROM users WHERE id = 42` must read from every column file to reconstruct one row — many small I/O operations instead of one.
- **Single-row inserts/updates**: modifying one row requires updating every column file. Most columnar databases batch writes and use an in-memory **delta store** for recent inserts.
- **Row-level transactions**: OLTP-style workloads with concurrent single-row operations are a poor fit.

### Columnar databases in practice

| Database    | Type | Notes |
|------------|------|-------|
| ClickHouse | Column-native | Open-source, extremely fast analytics |
| DuckDB     | Column-native | Embedded analytical database (SQLite for analytics) |
| Amazon Redshift | Column-native | Cloud data warehouse |
| BigQuery   | Column-native | Serverless analytics (Google) |
| Parquet    | Column file format | Used by Spark, Hive, Presto |
| Apache Arrow | Column in-memory format | Zero-copy interchange between systems |

**ClickHouse** is a columnar database that powers real-time analytics at Cloudflare, Uber, and eBay. It demonstrates why column storage matters.

**Benchmark scenario**: A table of 1 billion web analytics events with 30 columns (timestamp, url, user_agent, country, response_time, status_code, ...). Total uncompressed size: ~300 GB.

**Query**: "Average response time by country for the last hour":
```sql
SELECT country, AVG(response_time)
FROM events
WHERE timestamp   now() - INTERVAL 1 HOUR
GROUP BY country;
```

**Row storage (PostgreSQL)**: must read all 30 columns = ~300 GB scan. Even at SSD speeds (3 GB/s), that is 100 seconds.

**Column storage (ClickHouse)**:
- Reads only 3 columns: timestamp (8 bytes), country (2 bytes dict-encoded), response_time (4 bytes).
- Total data: 14 bytes * 1B rows = 14 GB uncompressed.
- With compression (typically 5-10x): reads ~2 GB from disk.
- At SSD speeds: **  1 second**.

**Compression in action**: The country column has ~200 distinct values for 1 billion rows. Dictionary encoding reduces each value to 1 byte. RLE after sorting by timestamp compresses further. The column goes from 2 GB to ~200 MB.

**Parquet files** bring the same benefits to data lakes. A Spark query over 10 TB of Parquet data with 100 columns, reading only 3 columns, actually scans ~300 GB + excellent compression — making "big data" queries feasible on modest hardware.

Row vs Column Storage Layout

Step 1 of 2