Why the Optimizer Needs Statistics
The query optimizer must choose the best execution plan for every SQL query. Should it use an index scan or a sequential scan? Which join algorithm is fastest? To answer these questions, it needs to estimate how many rows will pass through each step of the plan. These estimates come from column statistics — metadata about the distribution of values in each column.
n_distinct
The simplest statistic is n_distinct: the number of distinct values in a column. If a column has 10 distinct values and the table has 10,000 rows, the optimizer estimates that a filter like WHERE status = 'active' returns about 1,000 rows (assuming uniform distribution). PostgreSQL stores this in pg_statistic and exposes it via pg_stats.n_distinct.
Most Common Values (MCV)
Real data is rarely uniformly distributed. The most_common_vals array stores the most frequent values (typically top 100), and most_common_freqs stores their frequencies as fractions. For WHERE country = 'US', the optimizer checks the MCV list first — if 'US' appears with frequency 0.35, it estimates 35% of rows match. This is far more accurate than assuming uniform distribution.
Equi-Depth Histograms
For values not in the MCV list, PostgreSQL uses an equi-depth (equi-height) histogram. The histogram divides the remaining values (after removing MCVs) into buckets with approximately equal numbers of rows. Each bucket stores only its upper bound. To estimate selectivity for WHERE age > 50, the optimizer finds which bucket boundaries 50 falls between and interpolates linearly within that bucket.
The ANALYZE Command
Statistics are collected by running ANALYZE tablename (or the autovacuum daemon runs it automatically). ANALYZE samples a fraction of the table (default: 30,000 rows × statistics target) and computes n_distinct, MCVs, and histograms. The default_statistics_target (default 100) controls the number of histogram buckets and MCV entries.
Stale Statistics Lead to Bad Plans
If the data changes significantly but ANALYZE has not run, the optimizer works with outdated statistics. It might estimate 100 rows when the real answer is 100,000, choosing a nested-loop join when a hash join would be orders of magnitude faster. This is one of the most common causes of sudden query performance degradation in production databases. Monitoring pg_stat_user_tables.last_autoanalyze helps detect staleness.
Reading PostgreSQL Statistics
Suppose you have a table orders with 1 million rows and a status column with values: 'pending', 'shipped', 'delivered', 'cancelled'.
Checking statistics after ANALYZE:
SELECT attname, n_distinct, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';
Result:
- n_distinct: 4
- most_common_vals: {delivered, shipped, pending, cancelled}
- most_common_freqs: {0.65, 0.20, 0.10, 0.05}
Now the optimizer processes WHERE status = 'shipped':
- Check MCV list — 'shipped' found with frequency 0.20.
- Estimated rows = 1,000,000 × 0.20 = 200,000 rows.
- With 200K rows, a sequential scan is likely cheaper than an index scan (which would require 200K random I/Os).
For a numeric column like amount:
- MCV list captures the most common amounts.
- The histogram covers the rest: buckets might be [0-50], [50-120], [120-300], [300-800], [800-5000].
WHERE amount > 300→ the optimizer estimates ~40% of non-MCV rows fall in the last two buckets.
When statistics go stale: After a massive bulk import that changes the distribution, queries may suddenly slow down. Running ANALYZE orders or increasing autovacuum_analyze_threshold resolves the problem.