What is a Buffer Pool?
A buffer pool is a fixed-size region of memory that a database engine uses to cache disk pages. Instead of reading from disk on every query, the database keeps frequently accessed pages in the buffer pool, dramatically reducing I/O.
Core structure
The buffer pool is divided into frames — fixed-size slots, each capable of holding one disk page (typically 4 KB or 8 KB). A page table (implemented as a hash map) maps each page_id to the frame that currently holds it. This allows O(1) lookup to check whether a page is already in memory.
Fetch operation
When a query needs a page:
- Check the page table — if the page_id maps to a frame, it is a buffer pool hit. Return a pointer to that frame.
- Buffer pool miss — the page is not in memory. Pick a victim frame (using a replacement policy), write the victim back to disk if its dirty flag is set, then read the requested page from disk into that frame.
- Update the page table — remove the old mapping, add the new one.
Pin count
Each frame maintains a pin count — the number of threads currently using that page. A pinned page (pin count > 0) must never be evicted. When a thread finishes with a page, it calls unpin, decrementing the count. Only frames with pin count 0 are candidates for eviction.
Dirty flag
When a page is modified in memory, its frame is marked dirty. Before evicting a dirty frame, the buffer pool must write the page back to disk. Clean frames can be evicted immediately since the disk already has the latest copy.
Why not use the OS page cache?
Operating systems provide their own virtual-memory page cache, but databases manage their own buffer pool because:
- The DB knows access patterns (e.g., sequential scans vs. index lookups) and can choose smarter replacement policies.
- The DB needs to control write ordering for crash recovery (WAL protocol).
- The OS uses a generic LRU policy that can be badly polluted by sequential scans.
- The DB can pin pages and manage concurrency at a finer granularity.
| Operation | Time complexity | Notes |
|---|---|---|
| Fetch (hit) | O(1) | Hash map lookup |
| Fetch (miss) | O(1) amortized | Plus disk I/O |
| Unpin | O(1) | Decrement counter |
| Flush | O(1) | Write page to disk |
Real-Life: PostgreSQL Shared Buffers
PostgreSQL's shared_buffers setting controls the size of its buffer pool. A typical production configuration sets this to 25% of available RAM (e.g., 8 GB on a 32 GB machine).
When you run a query like SELECT * FROM users WHERE id = 42:
- The executor determines it needs page 7 of the
userstable. - It calls the buffer manager to fetch page 7.
- If page 7 is already in shared buffers (a hit), PostgreSQL returns a pointer immediately — no disk I/O at all.
- If it is a miss, PostgreSQL picks a victim buffer using its clock-sweep algorithm, writes back the victim if dirty, reads page 7 from disk, and pins it.
Why this matters for performance:
- A buffer pool hit takes ~100 nanoseconds (memory access).
- A disk read takes ~5-10 milliseconds (HDD) or ~100 microseconds (SSD).
- That is a 1,000x to 100,000x difference. A well-tuned buffer pool keeps the hot working set in memory, making most queries disk-free.
MySQL's InnoDB uses a similar design with its buffer pool, typically configured to 70-80% of available memory. It additionally maintains a young and old sublist to resist scan pollution.