How the Query Planner Chooses an Execution Plan

Prerequisites(3)

Column Statistics & Histograms Join Algorithms (NLJ, Hash, Merge)B+ Tree

Builds on to(1)

When you submit a SQL query, the database does not execute it literally. The query planner (also called the optimizer) transforms the SQL into an execution plan tree — a directed acyclic graph of physical operators that specifies exactly how to retrieve and process the data. The planner considers many possible plans and selects the one with the lowest estimated cost.

Plan Node Types

The execution plan is a tree of plan nodes, each representing a physical operation:

Scan Operators (leaf nodes — read data from tables):

SeqScan: Sequential scan reads every page of the table. Cheapest for large fractions of the table.
IndexScan: Uses a B-tree index to look up specific rows. Fetches heap tuples one by one. Best for highly selective queries.
IndexOnlyScan: Satisfies the query entirely from the index, never touching the heap. Only works if all needed columns are in the index.
BitmapScan: Two-phase scan. First, a BitmapIndexScan builds a bitmap of matching page locations. Then, a BitmapHeapScan fetches those pages in physical order. Ideal for medium selectivity — too many rows for IndexScan, too few for SeqScan.

Join Operators (combine two input streams):

NestedLoop: For each row from the outer table, scan the inner table. O(n*m) worst case, but efficient when the inner side uses an index. Best for small outer sets.
HashJoin: Build a hash table on the smaller relation, then probe it with the larger relation. O(n+m) but requires memory for the hash table.
MergeJoin: Merge two pre-sorted inputs. O(n+m) and does not require extra memory, but both inputs must be sorted (or an explicit Sort node is added).

Other Operators:

Sort: Sorts input tuples (for ORDER BY, MergeJoin, or DISTINCT).
HashAggregate: Groups rows using a hash table for GROUP BY.
Materialize: Stores intermediate results in memory/disk for reuse.

The Cost Model

PostgreSQL estimates cost in arbitrary cost units (not seconds). Each plan node has:

startup_cost: Cost before the first tuple can be emitted (e.g., building a hash table).
total_cost: Cost to emit all tuples.

The cost factors are configurable:

seq_page_cost = 1.0: Cost of reading one sequential disk page.
random_page_cost = 4.0: Cost of one random I/O (seeks are expensive on HDDs, less so on SSDs).
cpu_tuple_cost = 0.01: CPU cost to process one tuple.
cpu_operator_cost = 0.0025: CPU cost to apply one operator (e.g., a comparison).

A SeqScan of a 1000-page table costs: 1000 * seq_page_cost + n_tuples * cpu_tuple_cost. An IndexScan returning 10 rows costs: 10 * random_page_cost + 10 * cpu_tuple_cost (much cheaper if only 10 rows match, much more expensive if thousands match due to random I/O).

EXPLAIN and EXPLAIN ANALYZE

EXPLAIN shows the planner's estimated plan without executing it. EXPLAIN ANALYZE actually executes the query and shows actual row counts and timing alongside the estimates. Comparing estimated vs. actual rows reveals whether the planner's statistics are accurate.

Plan Caching and JIT

Prepared statements can cache execution plans across repeated executions, avoiding re-planning overhead. For complex queries with many expressions, PostgreSQL can use JIT compilation (via LLVM) to compile expression evaluation into native machine code, avoiding the overhead of interpreted expression evaluation.

Reading an EXPLAIN ANALYZE Output

Real-World Example

Consider the query:

EXPLAIN ANALYZE
SELECT o.id, c.name, SUM(o.amount)
FROM orders o JOIN customers c ON o.customer_id = c.id
WHERE o.order_date > '2024-01-01'
GROUP BY o.id, c.name;

Sample output:

HashAggregate  (cost=1250.00..1350.00 rows=5000)
               (actual time=45.2..48.1 rows=4823 loops=1)
  Group Key: o.id, c.name
  -> Hash Join  (cost=85.00..1100.00 rows=5000)
                (actual time=2.1..38.5 rows=4823 loops=1)
     Hash Cond: (o.customer_id = c.id)
     -> Bitmap Heap Scan on orders o  (cost=42.00..950.00 rows=5000)
                                      (actual time=1.2..25.3 rows=4823 loops=1)
        Recheck Cond: (order_date > '2024-01-01')
        -> Bitmap Index Scan on idx_orders_date  (cost=0.00..41.00 rows=5000)
                                                 (actual time=0.8..0.8 rows=4823 loops=1)
     -> Hash  (cost=30.00..30.00 rows=1000)
              (actual time=0.6..0.6 rows=1000 loops=1)
        -> Seq Scan on customers c  (cost=0.00..30.00 rows=1000)
                                    (actual time=0.01..0.3 rows=1000 loops=1)
Planning Time: 0.5 ms
Execution Time: 50.2 ms

Reading the plan bottom-up:

Seq Scan on customers: Reads all 1000 customers (small table, full scan is fine).
Hash: Builds a hash table from the 1000 customer rows (startup cost).
Bitmap Index Scan on idx_orders_date: Uses the date index to find pages with orders after 2024-01-01. Found 4823 matching rows.
Bitmap Heap Scan: Fetches those pages in order. The estimate (5000) was close to the actual (4823) — good statistics!
Hash Join: Probes the customer hash table for each order row. Very fast with 1000 customers.
HashAggregate: Groups by o.id and c.name using a hash table.

Key insight: The planner chose BitmapScan over IndexScan because 4823 rows is too many for individual random I/Os. It chose HashJoin over NestedLoop because building a hash table on 1000 customers and probing 4823 times is faster than 4823 individual index lookups.

Query Execution Plan Tree

Interactive Query Planner

InteractiveExplore a query execution plan tree and see how indexes affect cost

SELECT * FROM orders o JOIN customers c ON o.cust_id = c.id
WHERE c.country = 'US' ORDER BY o.total

Total cost: 386

Click any node to see estimated rows and cost

Sort (by o.total)

Filter (c.country = 'US')

Hash Join (o.cust_id = c.id)

Seq Scan orders

Index Scan customers

Loading demo...

Step 1 of 2

Query Planner & Cost Model