From Data Lakes to Lakehouses

Prerequisites(3)

Data Warehouse (Snowflake, BigQuery)Parquet / ORC Copy-on-Write FS (ZFS, Btrfs)

A data lake stores raw data in its original format on cheap, scalable object storage (Amazon S3, Google Cloud Storage, Azure Blob). Unlike a data warehouse, which requires data to conform to a schema before loading (schema-on-write), a data lake stores data first and applies schema when reading (schema-on-read). This flexibility enables storing structured (CSV, Parquet), semi-structured (JSON, Avro), and unstructured data (images, logs) in one place.

The Data Lake Problem

Early data lakes stored files as plain Parquet or CSV in S3 directories. This worked for append-only batch processing but lacked critical features:

No ACID transactions: Concurrent writers could corrupt data. A failed write could leave partial files.
No schema enforcement: Anyone could write data in any format, leading to "data swamps" with inconsistent, untrustworthy data.
No time travel: Once you overwrite data, the old version is gone.
Expensive metadata operations: Listing files in an S3 directory with millions of files is slow. There was no efficient way to answer "which files contain data for partition X?"

The Lakehouse: Best of Both Worlds

A lakehouse adds a table format layer on top of object storage that provides ACID transactions, schema enforcement, and efficient metadata management—while keeping the data in open formats on cheap storage. Three major table formats have emerged:

Apache Iceberg

Developed at Netflix, now an Apache project. Key features:

Snapshot isolation: Each write creates a new snapshot. Readers always see a consistent snapshot. No partial reads.
Schema evolution: Add, rename, drop, or reorder columns without rewriting data files.
Hidden partitioning: Define partition transforms (e.g., day(timestamp)) in metadata. Queries automatically prune partitions without users needing to know the partition scheme.
Time travel: Query any historical snapshot by ID or timestamp. Roll back to a previous state.
Manifest files: A metadata tree (manifest list -> manifests -> data files) enables efficient partition pruning without listing S3 directories.

Delta Lake

Created by Databricks. Key features:

Transaction log: A JSON-based log in _delta_log/ directory records every change. Optimistic concurrency control resolves conflicts.
MERGE support: SQL MERGE (upsert) operations on data lakes, critical for CDC ingestion.
Z-ordering: Co-locate related data in files for multi-dimensional partition pruning.
Liquid clustering: Automatic data layout optimization without manual partitioning.

Apache Hudi

Created at Uber. Key features:

Upsert-optimized: Designed for incremental data ingestion (e.g., CDC streams).
Copy-on-Write vs Merge-on-Read: Two table types trading write latency for read performance.
Incremental queries: Read only the data that changed since a given commit, enabling efficient incremental ETL.

Object Storage as the New "Disk"

The lakehouse paradigm treats S3/GCS as the storage layer (replacing HDFS), with compute engines (Spark, Trino, Flink) accessing data through the table format. This separates storage from compute, enabling independent scaling and cost optimization.

Real-Life: Apache Iceberg at Netflix and Apple

Real-World Example

Netflix (Iceberg's origin):

Netflix processes petabytes of data daily across tens of thousands of tables.
Before Iceberg, their Hive-based data lake suffered from consistency issues (concurrent writes corrupted tables), slow metadata operations (listing millions of S3 files), and no schema evolution (any column change required rewriting the entire table).
Iceberg solved all three: snapshot isolation for consistency, manifest files for fast metadata, and schema evolution for agility.
Netflix reports that Iceberg reduced planning time for large queries from minutes to seconds.

Apple:

Apple is one of the largest Iceberg deployments, reportedly managing exabyte-scale data.
They contributed significantly to Iceberg's development, especially around metadata management and performance optimization.

Databricks + Delta Lake:

Databricks' default table format. Powers millions of queries per day across their customer base.
Unity Catalog provides governance (access control, lineage, auditing) on top of Delta tables.
Delta Sharing: open protocol for securely sharing Delta tables across organizations without copying data.

Uber + Hudi:

Uber created Hudi to handle their massive CDC workload: hundreds of microservices writing to a shared data lake.
Hudi's upsert capability enables Uber to incrementally update their analytics tables as rides complete, payments process, and driver locations update—all in near real-time.

Lakehouse Architecture with Iceberg

Step 1 of 2

Data Lake & Lakehouse (Iceberg, Delta)