Back to DAG

Hash Index

databases

What is a Hash Index?

A hash index maps keys to records using a hash function: hash(key) → bucket page. Each bucket is a disk page (or chain of pages) that holds all records whose keys hash to that bucket. Hash indexes provide O(1) average-case equality lookups — even faster than B-trees for exact-match queries.

Bucket pages

The hash index maintains an array of bucket pointers, one per bucket. Each bucket is a page that stores key-value pairs (or key-TID pairs, where TID is a pointer to the heap record). When a bucket overflows, an overflow page is chained to it via a linked-list pointer.

Static hashing

In the simplest scheme, the number of buckets is fixed at creation time:

  • bucket = hash(key) % num_buckets
  • As the table grows, overflow chains get longer.
  • Performance degrades because long chains require multiple page reads (pointer chasing).

Static hashing works well when the data size is known in advance and does not change much.

Extendible hashing

Extendible hashing avoids long overflow chains by dynamically growing the bucket count:

  • A global directory maps the first d bits of hash(key) to a bucket page. d is the global depth.
  • Each bucket has a local depth l indicating how many hash bits it uses.
  • When a bucket overflows: if l < d, the bucket splits and entries are redistributed using one more bit. If l == d, the directory doubles (increment global depth), then split the bucket.
  • Only the overflowing bucket splits — no full rehash. The directory doubles, but directory entries are just pointers (small).

Linear hashing

Linear hashing avoids a directory entirely:

  • Buckets are split in a round-robin order, not on demand.
  • A split pointer tracks which bucket to split next.
  • When load factor exceeds a threshold, the bucket at the split pointer is split and the pointer advances.
  • After all original buckets are split, the pointer resets and the hash range doubles.

Linear hashing provides smooth, incremental growth without a central directory.

Limitation: equality only

Hash indexes can only answer equality queries (WHERE key = 42). They cannot support:

  • Range queries (WHERE key BETWEEN 10 AND 50)
  • Prefix queries (WHERE name LIKE 'Al%')
  • Ordering (ORDER BY key)

This is because hashing destroys key ordering. For range queries, B-trees are necessary.

Real-world usage

  • PostgreSQL hash index: supports only equality lookups. Since PostgreSQL 10, hash indexes are WAL-logged and crash-safe. Still, B-tree indexes are preferred because they handle both equality and range queries.
  • Bitcask (Riak): a log-structured storage engine that keeps a full hash index in memory, mapping every key to a file offset. Extremely fast reads (one disk seek), but the entire keyspace must fit in RAM.
  • Hash join: the database builds a hash table (hash index) on the smaller relation during a hash join, then probes it with each row from the larger relation.

Real-Life: Bitcask and PostgreSQL Hash Index

Real-World Example

Bitcask (Riak's default storage engine):

Bitcask uses the simplest possible hash index design:

  1. All writes are appended to a log file (no overwrites).
  2. A hash table in memory maps every key to (file_id, offset, size).
  3. To read a key: look up the hash table (O(1)), seek to the file position, read the value.
  4. On restart, scan all log files to rebuild the hash table.

Limitation: the hash table must fit in RAM. If you have 1 billion keys with 100-byte average key size, that is ~100 GB of RAM just for the index. This is why Bitcask is suited for workloads with a moderate number of keys but high read/write throughput.

PostgreSQL hash index:

CREATE INDEX idx_email ON users USING hash (email);

This creates a hash index on the email column. Queries like SELECT * FROM users WHERE email = 'alice@example.com' will use it. But WHERE email LIKE 'alice%' or WHERE email > 'a' cannot use the hash index — those need a B-tree.

PostgreSQL's hash index uses a form of linear hashing: buckets split incrementally as the table grows. Each bucket is a page, with overflow pages chained when needed.

When to use hash indexes: Only when you exclusively need equality lookups on a column and never range queries. In practice, most developers use B-tree indexes because they handle both cases. Hash indexes shine for very large tables with only point lookups — the O(1) vs O(log n) difference matters at scale.

Hash Index: Static Hashing with Overflow

Hash Index: 4 Buckets, Static Hashing bucket = hash(key) % 4 B0 B1 B2 B3 alice→p1 bob→p5 carol→p2 dave→p8 eve→p3 frank→p9 greg→p11 henry→p4 iris→p6 jack→p7 Bucket 1: overflow chain (3 pages!) Bucket 2: mostly empty (wasted space) Hash Index vs B-Tree Index Hash Index B-Tree Index Equality (key = 42) O(1) average O(log n) Range (key > 10) NOT SUPPORTED O(log n + k) Order (ORDER BY) NOT SUPPORTED Built-in (sorted)
Step 1 of 2