How GROUP BY and Aggregation Work

Prerequisites(1)

Builds on to(1)

Aggregation collapses multiple rows into a single summary value. SELECT COUNT(*) FROM orders counts all rows. SELECT SUM(amount) FROM orders adds up every amount. These are scalar aggregations — they produce one output row from the entire table.

GROUP BY partitions rows into groups based on one or more columns, then applies aggregate functions independently to each group:

SELECT country, COUNT(*), SUM(amount)
FROM orders
GROUP BY country;

This produces one output row per distinct country, with the count and total amount for that country.

Aggregate Functions

Function	Description
COUNT(*)	Number of rows in each group
COUNT(column)	Number of non-NULL values
COUNT(DISTINCT column)	Number of unique non-NULL values
SUM(column)	Total of all values
AVG(column)	Average (SUM / COUNT)
MIN(column)	Smallest value
MAX(column)	Largest value

WHERE vs HAVING

WHERE filters individual rows before grouping. Rows that do not pass the WHERE clause are never assigned to any group.
HAVING filters groups after aggregation. It operates on the aggregate results, not individual rows.

SELECT country, SUM(amount) AS total
FROM orders
WHERE status = 'completed'        -- filters rows BEFORE grouping
GROUP BY country
HAVING SUM(amount) > 10000;       -- filters groups AFTER aggregation

Only completed orders are grouped, and only countries whose completed-order total exceeds 10,000 appear in the result.

Execution: Hash Aggregate vs Sort Aggregate

Databases use two internal strategies to execute GROUP BY:

Hash Aggregate: build a hash table keyed by the GROUP BY columns. For each input row, hash the group key, find or create the group entry, and update the running aggregates (increment count, add to sum, etc.). After processing all rows, emit each hash table entry as an output row.

Time: O(n), one pass over the data.
Memory: O(g) where g = number of distinct groups. If groups do not fit in memory, the engine spills to disk.
Used when: groups are not pre-sorted, and the number of groups is manageable.

Sort Aggregate: sort the input by the GROUP BY columns (or exploit an existing sort order from an index). Then scan sequentially — as long as the group key is unchanged, accumulate aggregates. When the key changes, emit the completed group and start a new one.

Time: O(n log n) for sorting + O(n) for the scan. Free O(n) if already sorted.
Memory: O(1) for the scan (only tracks the current group). Sorting may need memory.
Used when: data is already sorted by the group columns, or the result must be sorted.

Partial Aggregation in Parallel Execution

Modern databases parallelize aggregation: each worker thread performs a partial aggregate on its partition of data, producing a smaller intermediate result. A final merge aggregate step combines partial results. This is the same pattern MapReduce uses — "map" does partial aggregation, "reduce" merges.

For example, with 4 threads and 100 million rows: each thread aggregates 25 million rows locally (fast, cache-friendly), producing maybe 1000 partial groups each. The merge step combines 4000 partial groups into the final result — orders of magnitude less work than aggregating 100 million rows single-threaded.

**Aggregation** collapses multiple rows into a single summary value. `SELECT COUNT(*) FROM orders` counts all rows. `SELECT SUM(amount) FROM orders` adds up every amount. These are **scalar aggregations** — they produce one output row from the entire table.

**GROUP BY** partitions rows into groups based on one or more columns, then applies aggregate functions independently to each group:

```
SELECT country, COUNT(*), SUM(amount)
FROM orders
GROUP BY country;
```

This produces one output row per distinct country, with the count and total amount for that country.

### Aggregate Functions

| Function | Description |
|----------|-------------|
| COUNT(*) | Number of rows in each group |
| COUNT(column) | Number of non-NULL values |
| COUNT(DISTINCT column) | Number of unique non-NULL values |
| SUM(column) | Total of all values |
| AVG(column) | Average (SUM / COUNT) |
| MIN(column) | Smallest value |
| MAX(column) | Largest value |

### WHERE vs HAVING

- **WHERE** filters individual rows **before** grouping. Rows that do not pass the WHERE clause are never assigned to any group.
- **HAVING** filters groups **after** aggregation. It operates on the aggregate results, not individual rows.

```
SELECT country, SUM(amount) AS total
FROM orders
WHERE status = 'completed'        -- filters rows BEFORE grouping
GROUP BY country
HAVING SUM(amount)   10000;       -- filters groups AFTER aggregation
```

Only completed orders are grouped, and only countries whose completed-order total exceeds 10,000 appear in the result.

### Execution: Hash Aggregate vs Sort Aggregate

Databases use two internal strategies to execute GROUP BY:

**Hash Aggregate**: build a hash table keyed by the GROUP BY columns. For each input row, hash the group key, find or create the group entry, and update the running aggregates (increment count, add to sum, etc.). After processing all rows, emit each hash table entry as an output row.

- Time: O(n), one pass over the data.
- Memory: O(g) where g = number of distinct groups. If groups do not fit in memory, the engine spills to disk.
- Used when: groups are not pre-sorted, and the number of groups is manageable.

**Sort Aggregate**: sort the input by the GROUP BY columns (or exploit an existing sort order from an index). Then scan sequentially — as long as the group key is unchanged, accumulate aggregates. When the key changes, emit the completed group and start a new one.

- Time: O(n log n) for sorting + O(n) for the scan. Free O(n) if already sorted.
- Memory: O(1) for the scan (only tracks the current group). Sorting may need memory.
- Used when: data is already sorted by the group columns, or the result must be sorted.

### Partial Aggregation in Parallel Execution

Modern databases parallelize aggregation: each worker thread performs a **partial aggregate** on its partition of data, producing a smaller intermediate result. A final **merge aggregate** step combines partial results. This is the same pattern MapReduce uses — "map" does partial aggregation, "reduce" merges.

For example, with 4 threads and 100 million rows: each thread aggregates 25 million rows locally (fast, cache-friendly), producing maybe 1000 partial groups each. The merge step combines 4000 partial groups into the final result — orders of magnitude less work than aggregating 100 million rows single-threaded.

Real-Life: Analytics Dashboard Query

Real-World Example

An e-commerce dashboard shows daily revenue by product category for the last 30 days, excluding refunded orders:

SELECT category, DATE(order_time) AS day, SUM(amount) AS revenue, COUNT(*) AS orders
FROM orders
WHERE order_time >= NOW() - INTERVAL '30 days' AND status != 'refunded'
GROUP BY category, DATE(order_time)
HAVING SUM(amount) > 100
ORDER BY day DESC, revenue DESC;

Execution plan (PostgreSQL with EXPLAIN ANALYZE):

Index scan on orders(order_time) to filter the last 30 days.
Filter to remove refunded orders.
Hash Aggregate on (category, date) — builds a hash table with ~(30 days * 50 categories) = 1500 groups. Each entry tracks running SUM and COUNT.
Filter (HAVING) to remove groups with revenue <= 100.
Sort by day DESC, revenue DESC.

The Hash Aggregate step processes millions of qualifying rows but produces only ~1500 output rows. This dramatic reduction is what makes aggregation so powerful for analytics.

COUNT(DISTINCT) caveat: COUNT(DISTINCT customer_id) per category is expensive because the engine must track every distinct value per group. For approximate results, databases offer HyperLogLog (PostgreSQL's hll extension) which estimates distinct counts with ~2% error using just 1.2 KB per group.

Hash Aggregate Execution

Step 1 of 3