Back to DAG

Subquery & CTE

databases

Subqueries and Common Table Expressions

Real-world SQL queries often need intermediate results — "find orders whose amount is above the average" or "show employees whose department has more than 10 people." Subqueries and CTEs let you compose complex queries from simpler building blocks.

Scalar Subquery

A subquery that returns a single value (one row, one column). It can appear in SELECT, WHERE, or HAVING:

SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

The inner query computes the average salary once, and the outer query uses it as a threshold. The database evaluates the scalar subquery once and substitutes the result.

Correlated Subquery

A subquery that references columns from the outer query. Unlike a scalar subquery, it is conceptually re-executed for every row of the outer query:

SELECT e.name, e.salary, e.department
FROM employees e
WHERE e.salary > (
  SELECT AVG(e2.salary) FROM employees e2 WHERE e2.department = e.department
);

This finds employees earning above their department's average. The inner query depends on e.department from the outer row, so it produces a different result per outer row. Correlated subqueries can be expensive — O(n * cost_of_inner) — but the optimizer often rewrites them.

EXISTS vs IN

Both test membership, but behave differently:

  • IN: WHERE id IN (SELECT customer_id FROM orders) — the subquery returns a list of values, and the outer row matches if its value is in that list. Computes the full subquery result first.
  • EXISTS: WHERE EXISTS (SELECT 1 FROM orders WHERE orders.customer_id = customers.id) — a correlated subquery that returns TRUE as soon as any matching row is found (short-circuits). Often faster than IN when the subquery would return many rows, because EXISTS can stop at the first match.

Subquery Flattening / Unnesting

A naive engine would execute correlated subqueries row by row. Modern optimizers flatten (unnest) subqueries into joins:

WHERE id IN (SELECT customer_id FROM orders)

becomes internally:

JOIN (SELECT DISTINCT customer_id FROM orders) sub ON id = sub.customer_id

This converts an O(n * m) correlated scan into an O(n + m) hash join. PostgreSQL calls this "subquery scan → hash semi join."

CTE (Common Table Expression)

A CTE defines a named temporary result using the WITH clause:

WITH high_value_orders AS (
  SELECT customer_id, SUM(amount) AS total
  FROM orders
  GROUP BY customer_id
  HAVING SUM(amount) > 10000
)
SELECT c.name, h.total
FROM customers c
JOIN high_value_orders h ON c.id = h.customer_id;

CTEs improve readability by breaking complex queries into logical steps. Multiple CTEs can be chained, and later CTEs can reference earlier ones.

CTE Materialization vs Inlining

The optimizer must decide: materialize the CTE (compute it once, store the result temporarily) or inline it (substitute the CTE definition into each reference, optimizing as if it were a subquery).

  • Materialization: good when the CTE is referenced multiple times or the result set is small.
  • Inlining: good when the optimizer can push predicates from the outer query into the CTE for better index usage.

PostgreSQL 12+ defaults to inlining CTEs that are referenced only once, and materializing those referenced multiple times. You can force behavior with AS MATERIALIZED or AS NOT MATERIALIZED.

Recursive CTE

WITH RECURSIVE enables queries over hierarchical or graph data (org charts, category trees, bill of materials):

WITH RECURSIVE org_tree AS (
  -- Base case: the CEO
  SELECT id, name, manager_id, 1 AS depth
  FROM employees WHERE manager_id IS NULL
  UNION ALL
  -- Recursive case: employees reporting to someone in org_tree
  SELECT e.id, e.name, e.manager_id, t.depth + 1
  FROM employees e
  JOIN org_tree t ON e.manager_id = t.id
)
SELECT * FROM org_tree ORDER BY depth, name;

The engine starts with the base case, then repeatedly executes the recursive step until no new rows are produced. This replaces what would otherwise require application-level loops or multiple queries.

Real-Life: Building a Report with CTEs

Real-World Example

An analytics team needs a report: "For each product category, show the category name, total revenue, the number of unique customers, and the percentage of total company revenue."

Without CTEs — one massive query with repeated subqueries:

SELECT category,
  SUM(amount) AS revenue,
  COUNT(DISTINCT customer_id) AS unique_customers,
  ROUND(100.0 * SUM(amount) / (SELECT SUM(amount) FROM orders), 2) AS pct_of_total
FROM orders GROUP BY category;

The scalar subquery (SELECT SUM(amount) FROM orders) is re-evaluated conceptually per group (though the optimizer likely caches it).

With CTEs — clean, readable, reusable:

WITH total AS (
  SELECT SUM(amount) AS company_total FROM orders
),
by_category AS (
  SELECT category, SUM(amount) AS revenue, COUNT(DISTINCT customer_id) AS unique_customers
  FROM orders GROUP BY category
)
SELECT bc.category, bc.revenue, bc.unique_customers,
  ROUND(100.0 * bc.revenue / t.company_total, 2) AS pct_of_total
FROM by_category bc CROSS JOIN total t
ORDER BY bc.revenue DESC;

Recursive CTE example — Bill of Materials: a product is assembled from sub-components, which are assembled from smaller parts. "What are all the parts needed to build product X?"

WITH RECURSIVE bom AS (
  SELECT part_id, component_id, quantity FROM assemblies WHERE part_id = 'X'
  UNION ALL
  SELECT a.part_id, a.component_id, a.quantity * bom.quantity
  FROM assemblies a JOIN bom ON a.part_id = bom.component_id
)
SELECT component_id, SUM(quantity) AS total_needed FROM bom GROUP BY component_id;

This recursively expands the assembly tree and computes the total quantity of each leaf component needed.

Correlated Subquery vs Flattened Join

Naive: Correlated Subquery WHERE id IN (SELECT cid FROM orders WHERE orders.cid = c.id) customers row 1 → row 2 → row 3 → ... n rows scan orders (m rows) scan orders (m rows) scan orders (m rows) Cost: O(n * m) Inner query re-executes for each outer row Optimized: Flattened to Hash Semi Join customers JOIN (SELECT DISTINCT cid FROM orders) ON id = cid orders (m rows, one pass) hash table distinct cids customers (n rows, one pass) probe Cost: O(n + m) Each table scanned exactly once Recursive CTE: org chart traversal CEO depth 1 (base case) VP Eng VP Sales depth 2 (iteration 1) Dev1 Dev2 depth 3 (iteration 2) ... until empty WITH RECURSIVE expands level by level until no new rows
Step 1 of 2