Columnar File Formats for Analytics
Apache Parquet and Apache ORC are columnar file formats designed for analytical workloads. Unlike row-oriented formats (CSV, JSON, traditional database pages) where all columns of a row are stored together, columnar formats store all values of a single column contiguously. This is transformative for analytics queries that typically read a small number of columns from tables with dozens or hundreds of columns.
File Structure: Row Groups, Column Chunks, Pages
A Parquet file is organized hierarchically:
-
Row Groups (~128 MB each): The file is divided into horizontal partitions. Each row group contains a chunk of rows (e.g., 1 million rows). Row groups enable parallel processing — different threads or nodes can process different row groups independently.
-
Column Chunks: Within each row group, data for each column is stored in a separate column chunk. A row group with 50 columns has 50 column chunks. Each column chunk is a contiguous block of bytes on disk.
-
Pages (~1 MB each): Each column chunk is further divided into pages. A page is the unit of compression and encoding. Pages also store statistics (minimum value, maximum value, null count) that enable predicate pushdown.
Encoding Techniques
Parquet uses specialized encodings to achieve high compression ratios:
-
Dictionary Encoding: For columns with low cardinality (e.g., country codes), replace each value with an integer ID. A dictionary page maps IDs back to values. Instead of storing "United States" a million times, store integer 42 a million times.
-
Run-Length Encoding (RLE): When the same value repeats consecutively (common after sorting), store the value once along with the count. "AAAAAABBB" becomes "(A,6)(B,3)".
-
Delta Encoding: For sorted or incrementing values (timestamps, auto-increment IDs), store the difference between consecutive values. Timestamps increasing by 1 second each become a sequence of 1s, which RLE compresses further.
Predicate Pushdown and Page Skipping
Each page stores min/max statistics. When executing WHERE timestamp > '2024-01-01', the reader checks the max value of each page. If the max is below the threshold, the entire page is skipped without decompression. For sorted data, this can skip 90%+ of pages.
Dremel Encoding for Nested Data
Parquet supports complex nested structures (arrays, maps, structs) using the Dremel encoding from Google's Dremel paper. Each value gets two integers: a repetition level (which repeated field this value belongs to) and a definition level (which optional fields are null vs. present). This flattens arbitrarily nested data into columnar format without losing structure.
Ecosystem
Parquet is the de facto standard in data lake architectures. Apache Spark, Apache Hive, Presto/Trino, DuckDB, Snowflake, and BigQuery all read and write Parquet natively. ORC is similar but originated in the Hive ecosystem and uses a slightly different structure (stripes instead of row groups).
Column Pruning and Page Skipping in Action
Consider a table events with 20 columns and 1 billion rows stored as Parquet:
Query: SELECT user_id, event_type FROM events WHERE timestamp > '2024-06-01'
Row-oriented format (CSV/JSON):
- Must read ALL 20 columns for every row: ~200 GB of I/O
- Must decompress and parse every field, even the 18 unused columns
Parquet columnar format:
-
Column Pruning: Only reads 3 column chunks per row group:
user_id,event_type,timestamp. The other 17 columns are never touched. I/O drops to ~30 GB. -
Page Skipping: The
timestampcolumn chunk has page statistics. Pages where max(timestamp) < '2024-06-01' are skipped entirely. If the data is sorted by timestamp, only the last 20% of pages need reading. I/O drops to ~6 GB. -
Dictionary Encoding: The
event_typecolumn has 50 distinct values. Dictionary encoding compresses it from ~15 bytes/row (string) to ~1 byte/row (integer ID). Column size drops from 15 GB to 1 GB.
Net result: Instead of reading 200 GB, Parquet reads ~3-6 GB. With compression, actual I/O may be under 1 GB. This is why analytical queries on Parquet are 10-100x faster than on CSV.
Real-world example: A data lake on S3 stores 5 TB of click data in Parquet format. A query reading 2 columns with a date filter reads only 50 GB from S3, saving both time and cloud egress costs.