What is Columnar Storage?
Columnar storage (also called the Decomposition Storage Model or DSM) stores each column of a table in a separate file or segment. Instead of grouping all columns of one row together, it groups all values of one column together.
Structure
For a table with columns (id, name, salary, department) and 1 million rows:
- Row storage: 1 million tuples, each containing all 4 fields, packed onto pages.
- Column storage: 4 separate column files. The
salaryfile contains 1 million salary values back-to-back. Thenamefile contains 1 million name values.
Why columnar storage dominates analytics
1. I/O efficiency: An analytical query SELECT AVG(salary) FROM employees reads only the salary column file. If each row has 50 columns totaling 200 bytes, but salary is 8 bytes, column storage reads 25x less data from disk. For large tables, this difference is measured in gigabytes of saved I/O.
2. Excellent compression: Values in the same column share the same data type and often have similar distributions. A column of US state codes (50 distinct values) compresses far better than mixed-type row data. Common techniques:
- Dictionary encoding: replace strings with integer codes. "California" (10 bytes) becomes 5 (1 byte).
- Run-Length Encoding (RLE): if sorted by department, "Engineering" repeated 10,000 times becomes (Engineering, 10000).
- Delta encoding: for timestamps or sequential IDs, store differences between consecutive values.
- Bit-packing: a column with values 0-15 needs only 4 bits per value instead of 32.
3. SIMD-friendly: Modern CPUs have SIMD (Single Instruction, Multiple Data) instructions that process 4, 8, or 16 values simultaneously. A column of integers laid out contiguously in memory is a perfect SIMD target. Operations like "filter where salary > 100000" can evaluate 8 values per CPU cycle using AVX2.
4. Late materialization: The query engine can operate on compressed column data as long as possible, only reconstructing full rows at the very end when producing results. This avoids creating intermediate tuples.
Tuple reconstruction
When a query needs multiple columns (SELECT name, salary WHERE salary > 100000), the database must reconstruct tuples by matching positions across column files. Row 42's name is at position 42 in the name file, and its salary is at position 42 in the salary file. This position-based alignment is fast but means column files must stay perfectly synchronized.
When columnar storage struggles
- Point queries:
SELECT * FROM users WHERE id = 42must read from every column file to reconstruct one row — many small I/O operations instead of one. - Single-row inserts/updates: modifying one row requires updating every column file. Most columnar databases batch writes and use an in-memory delta store for recent inserts.
- Row-level transactions: OLTP-style workloads with concurrent single-row operations are a poor fit.
Columnar databases in practice
| Database | Type | Notes |
|---|---|---|
| ClickHouse | Column-native | Open-source, extremely fast analytics |
| DuckDB | Column-native | Embedded analytical database (SQLite for analytics) |
| Amazon Redshift | Column-native | Cloud data warehouse |
| BigQuery | Column-native | Serverless analytics (Google) |
| Parquet | Column file format | Used by Spark, Hive, Presto |
| Apache Arrow | Column in-memory format | Zero-copy interchange between systems |
Real-Life: ClickHouse Analytics
ClickHouse is a columnar database that powers real-time analytics at Cloudflare, Uber, and eBay. It demonstrates why column storage matters.
Benchmark scenario: A table of 1 billion web analytics events with 30 columns (timestamp, url, user_agent, country, response_time, status_code, ...). Total uncompressed size: ~300 GB.
Query: "Average response time by country for the last hour":
SELECT country, AVG(response_time)
FROM events
WHERE timestamp > now() - INTERVAL 1 HOUR
GROUP BY country;
Row storage (PostgreSQL): must read all 30 columns = ~300 GB scan. Even at SSD speeds (3 GB/s), that is 100 seconds.
Column storage (ClickHouse):
- Reads only 3 columns: timestamp (8 bytes), country (2 bytes dict-encoded), response_time (4 bytes).
- Total data: 14 bytes * 1B rows = 14 GB uncompressed.
- With compression (typically 5-10x): reads ~2 GB from disk.
- At SSD speeds: < 1 second.
Compression in action: The country column has ~200 distinct values for 1 billion rows. Dictionary encoding reduces each value to 1 byte. RLE after sorting by timestamp compresses further. The column goes from 2 GB to ~200 MB.
Parquet files bring the same benefits to data lakes. A Spark query over 10 TB of Parquet data with 100 columns, reading only 3 columns, actually scans ~300 GB + excellent compression — making "big data" queries feasible on modest hardware.