Back to DAG

TLB

hardware

Translation Lookaside Buffer

Every memory access in a modern system uses virtual addresses. Before the CPU can read or write physical memory, the Memory Management Unit (MMU) must translate the virtual page number to a physical frame number by walking the page table -- a multi-level tree structure stored in main memory. On x86-64, this walk traverses four levels (PML4 -> PDPT -> PD -> PT), each requiring a separate memory read. At ~100 ns per read, a full page table walk costs ~400 ns -- far too expensive when the CPU issues billions of memory accesses per second.

The TLB: A Cache for Page Table Entries

The Translation Lookaside Buffer (TLB) is a small, fast hardware cache that stores recent virtual-to-physical page mappings. On a TLB hit, the translation completes in 1-2 cycles (sub-nanosecond), completely bypassing the page table walk.

TLB Structure

TLBs are typically fully associative -- any entry can map any virtual page. This maximizes hit rate but limits size (a fully associative lookup is expensive at large scale). Typical sizes:

TLB LevelEntriesPage SizeCoverage
L1 DTLB644 KB256 KB
L1 ITLB644 KB256 KB
L2 STLB (unified)512-20484 KB2-8 MB
Huge page TLB322 MB64 MB

With huge pages (2 MB or 1 GB), a single TLB entry covers vastly more memory, dramatically reducing TLB miss rates for large working sets.

TLB Miss Penalty

On a TLB miss, the MMU performs a hardware page table walk. If the page table levels are themselves cached in L1/L2 (which is common), the penalty is 10-30 cycles. If the page table entries must be fetched from main memory, the penalty is 200-400+ cycles. The worst case occurs when the page itself is not resident -- the OS must handle a page fault, which can cost millions of cycles for a disk I/O.

TLB Flush on Context Switch

Each process has its own page table (its own virtual address space). When the OS switches processes, the TLB entries from the old process are invalid for the new one. The naive approach is to flush the entire TLB on every context switch, but this is costly -- the new process starts with a cold TLB.

Modern hardware solves this with ASIDs (Address Space Identifiers) or Intel's PCIDs (Process-Context Identifiers). Each TLB entry is tagged with the ASID of the process that created it. On a context switch, the OS simply sets a new ASID; old entries remain but are not matched. This allows TLB entries to survive context switches, dramatically reducing the warm-up cost.

TLB Shootdown in Multi-Core Systems

When one core updates a page table mapping (e.g., the OS unmaps a page), other cores may still have the stale mapping in their local TLBs. The OS must perform a TLB shootdown: it sends an Inter-Processor Interrupt (IPI) to all affected cores, forcing them to invalidate the stale TLB entry. This is one of the most expensive operations in a multi-core OS -- it stalls all targeted cores until the invalidation completes. Reducing TLB shootdowns is a key optimization in high-performance systems (e.g., using RCU for page table updates, or batching invalidations).

TLB in Practice: Database Buffer Pools

Real-World Example

Why databases use huge pages:

A database buffer pool of 128 GB mapped with 4 KB pages requires 33 million page table entries. Even a 2048-entry L2 TLB can only cover 8 MB -- a tiny fraction. Every random page access likely misses the TLB, adding 200+ cycles of page table walk overhead.

Switching to 2 MB huge pages reduces the entries needed to just 65,536. The same TLB can now cover 4 GB, and the miss rate drops by orders of magnitude. PostgreSQL, Oracle, and MySQL all recommend enabling huge pages for production buffer pools.

Other TLB-aware optimizations:

  • JVM large pages (-XX:+UseLargePages): Reduces GC pause-related TLB misses when scanning a multi-GB heap.
  • Linux transparent huge pages (THP): Automatically promotes 4 KB pages to 2 MB when possible, though THP compaction can cause latency spikes.
  • DPDK and network stacks: Pin packet buffers to huge pages so per-packet TLB misses do not bottleneck 100 Gbps processing.

Address Translation: TLB Hit vs Miss

Virtual Address VPN | Offset TLB Lookup Fully Associative 64-2048 entries HIT (1-2 cyc) Physical Address MISS PML4 PDPT PD PT 4-level page table walk (~200-400 cycles) TLB refilled with new mapping Context Switch Handling Without ASID: Flush entire TLB (expensive -- cold start for new process) With ASID/PCID: Tag entries with process ID -- old entries survive, only new ASID is set PCID avoids flush
Step 1 of 2