Back to DAG

Page Cache

os

Caching Disk Blocks in RAM

The page cache is a region of main memory (RAM) that the operating system uses to cache the contents of disk blocks. Whenever a file is read from disk, the data is stored in the page cache so that subsequent reads of the same data are served directly from RAM without touching the disk. This is critical for performance because RAM access is roughly 100,000 times faster than a mechanical hard drive and 1,000 times faster than an SSD.

How the Page Cache Works

When a process calls read() on a file:

  1. The VFS and filesystem driver determine which disk blocks contain the requested data.
  2. The kernel checks the page cache: if the blocks are already cached (cache hit), the data is copied directly from RAM to the user's buffer. No disk I/O occurs.
  3. If the blocks are not in the cache (cache miss), the kernel reads them from disk into a free page in the cache, then copies the data to the user's buffer. The page remains in the cache for future reads.

Read-Ahead (Prefetching)

The kernel detects sequential read patterns and proactively reads ahead, loading blocks into the cache before the application requests them. For example, if a process reads blocks 0, 1, 2 in order, the kernel prefetches blocks 3, 4, 5, 6, ... into the cache. When the process gets to block 3, it is already in RAM. The read-ahead window grows dynamically (starting at ~128 KB, growing up to ~512 KB) and resets if the access pattern becomes random. This makes sequential file processing (log parsing, video streaming, data pipelines) extremely fast.

Dirty Pages and Writeback

When a process calls write(), the data is written to the page cache, not directly to disk. The modified page is marked dirty. The kernel periodically flushes dirty pages to disk via writeback threads (historically pdflush, now kworker threads). This batching of writes is far more efficient than individual synchronous writes because:

  • Multiple writes to the same page result in a single disk write.
  • The disk scheduler can reorder writes for optimal I/O patterns (e.g., elevator algorithm).
  • The application is not blocked waiting for slow disk I/O on each write() call.

The writeback interval is typically every 5 seconds (controlled by dirty_writeback_centisecs), or when dirty pages exceed a threshold (e.g., 10% of RAM, controlled by dirty_background_ratio).

fsync() -- Forcing Durability

The fsync(fd) system call forces all dirty pages associated with a file descriptor to be written to disk synchronously. It blocks until the disk confirms the data is durable. Databases use fsync() after committing a transaction to guarantee that committed data survives a power failure. Without fsync(), data written to the page cache could be lost if the system crashes before writeback occurs.

Memory Pressure and Eviction

The page cache grows to consume all available free memory. This is by design -- free memory that is not caching anything is wasted capacity. The Linux philosophy is "free memory is wasted memory." When a process needs to allocate memory and none is free, the kernel evicts the least recently used (LRU) pages from the cache to free up frames. Pages backed by files on disk are cheap to evict because they can be re-read from disk if needed later. Dirty pages must be written back before eviction.

You can observe page cache usage with free -h: the "buff/cache" column shows how much RAM the kernel is using for the page cache. Even a system using 14 GB of 16 GB RAM might only have 4 GB used by applications -- the other 10 GB is page cache, instantly reclaimable when applications need it.

Impact on Application Performance

The page cache is why the second run of a command like grep -r pattern /src is dramatically faster than the first. The first run reads all files from disk (populating the cache). The second run finds everything already in RAM. This also explains why database servers benefit from large amounts of RAM even beyond what their data set requires -- the page cache keeps frequently accessed data pages in memory.

Page Cache in a Database Server

Real-World Example

A PostgreSQL database server with 32 GB of RAM and a 100 GB dataset on an SSD:

First query after restart:

SELECT * FROM orders WHERE customer_id = 42;
-- Time: 45 ms (data read from SSD, ~100 us per page fault)

The table pages are not in the page cache. The kernel reads them from the SSD (100 microseconds per 4 KB page). The pages are now cached.

Same query, seconds later:

SELECT * FROM orders WHERE customer_id = 42;
-- Time: 0.3 ms (data served from page cache, RAM speed)

The pages are now in the page cache. The query is 150x faster because it never touches the disk.

After running a large batch job that reads a 50 GB file:

SELECT * FROM orders WHERE customer_id = 42;
-- Time: 45 ms again (database pages evicted by batch job)

The batch job filled the page cache with its own data, evicting the database pages. This is called cache thrashing. Mitigation strategies include:

  • Using posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) to tell the kernel not to cache the batch file.
  • Using O_DIRECT to bypass the page cache entirely for the batch job.
  • Increasing RAM so both the database working set and batch data can coexist.

Monitoring page cache:

$ free -h
              total   used   free   shared  buff/cache   available
Mem:           32G    4.2G   1.3G     512M       26.5G       27.1G

Only 4.2 GB is used by processes. 26.5 GB is page cache. 27.1 GB is "available" (application-allocatable), because the page cache can be instantly reclaimed.

Page Cache: Read and Write Paths

Page Cache: Read Path and Write Path Application read() write() Page Cache (RAM) Page A Page B Dirty C Page D Dirty E free Read-ahead: kernel prefetches sequential blocks before app asks CACHE HIT ~100 ns Disk (SSD/HDD) SSD: ~100 us | HDD: ~10 ms cache miss: read from disk writeback every ~5s or on fsync() fsync(fd): force dirty pages to disk NOW (blocks until done)
Step 1 of 2