What is a Data Warehouse?
A data warehouse is a system optimized for analytical queries on large volumes of historical data. Unlike OLTP databases that handle individual transactions (insert one order, update one user), data warehouses process complex queries that scan millions or billions of rows to compute aggregations, trends, and reports.
Star Schema
The most common data warehouse schema is the star schema:
- Fact table: The central table containing measurable events or transactions. Each row is an event: a sale, a click, a shipment. Fact tables are typically very large (billions of rows) and contain numeric measures (revenue, quantity, duration) plus foreign keys to dimension tables.
- Dimension tables: Surrounding tables that describe the "who, what, when, where" of each event. Examples:
dim_customer(customer name, address, segment),dim_product(product name, category, brand),dim_date(day, month, quarter, holiday flag).
Queries join the fact table with dimension tables: "total revenue by product category by quarter." The star schema denormalizes dimensions for query simplicity.
Snowflake Schema
A variation where dimension tables are further normalized. For example, dim_product has a foreign key to dim_category, which has a foreign key to dim_department. This reduces data redundancy but makes queries more complex (more joins).
ETL: Extract, Transform, Load
Data flows from OLTP systems to the warehouse via ETL:
- Extract: Pull data from source systems (databases, APIs, files).
- Transform: Clean, deduplicate, conform to warehouse schema, compute derived fields.
- Load: Write transformed data into the warehouse tables.
ETL runs periodically (hourly, daily) or continuously (streaming ETL / ELT). Modern data warehouses often prefer ELT: load raw data first, then transform inside the warehouse using SQL, leveraging the warehouse's massive compute power.
Modern Cloud Data Warehouses
Snowflake: Separates storage (on S3/GCS/Azure Blob) from compute (virtual warehouses). Multiple compute clusters can query the same data independently. Auto-scales compute up and down. Pay per second of compute used.
BigQuery: Serverless, no cluster management. Uses a slot-based execution model: each query is assigned a number of "slots" (units of parallel computation). Columnar storage (Capacitor format) with automatic partitioning. Supports nested/repeated fields natively (STRUCT, ARRAY).
Amazon Redshift: Based on PostgreSQL, uses a shared-nothing architecture. Data is distributed across compute nodes using distribution keys. Columnar storage with automatic compression.
Real-Life: Snowflake Architecture
Snowflake's three-layer architecture is a landmark in data warehouse design:
1. Storage Layer (Cloud Object Storage)
- Data is stored in a proprietary columnar format on S3/GCS/Azure Blob.
- Immutable micro-partitions: each micro-partition is 50–500 MB of compressed columnar data.
- Metadata catalog tracks min/max values per column per micro-partition, enabling partition pruning.
2. Compute Layer (Virtual Warehouses)
- Virtual warehouses are clusters of compute nodes. Each warehouse caches data locally in SSD.
- Multiple warehouses can read the same data simultaneously (e.g., one for the dashboard team, one for the data science team).
- Warehouses auto-suspend after inactivity and auto-resume on query arrival.
3. Cloud Services Layer
- Handles authentication, query optimization, metadata management, and transaction coordination.
- The query optimizer uses micro-partition metadata to prune irrelevant partitions before scanning.
Example query flow:
- Analyst submits:
SELECT region, SUM(revenue) FROM fact_sales JOIN dim_date USING(date_id) WHERE year = 2024 GROUP BY region - Cloud services layer parses, optimizes, and prunes micro-partitions to only those with year=2024 data.
- Virtual warehouse nodes read only the
regionandrevenuecolumns (columnar skip). - Parallel aggregation across nodes, results returned in seconds.