Multi-Level Cache Hierarchy
Modern CPUs access main memory hundreds of times slower than they execute instructions. To bridge this gap, processors use a multi-level cache hierarchy -- small, fast SRAM buffers that keep frequently accessed data close to the core.
L1 Cache (~32 KB, 1-4 cycles)
L1 is split into two separate caches per core: the I-cache (instructions) and the D-cache (data). This split allows the CPU to fetch an instruction and load/store data in the same cycle. L1 is the fastest cache but also the smallest -- typically 32 KB per half. Its associativity is usually 8-way, balancing hit rate against lookup speed.
L2 Cache (~256 KB, ~10 cycles)
L2 is a unified cache (holding both instructions and data) private to each core. It acts as a backstop for L1 misses. Because it is larger, it can capture working sets that spill out of L1 while still being much faster than main memory.
L3 Cache (~8-32 MB, 30-40 cycles)
L3 is shared across all cores on the chip. It serves two purposes: it catches misses from any core's L2, and it acts as a coherence point -- when one core writes data that another core has cached, the L3 helps coordinate. Intel's designs often use an inclusive L3 (every line in L1/L2 also lives in L3), while AMD's Zen architecture uses a victim cache (non-inclusive) L3 to maximize effective capacity.
Inclusion Policies
- Inclusive: L3 contains a superset of L1+L2 lines. Simplifies coherence (just check L3 tags) but wastes capacity on duplicates.
- Exclusive: A line exists in exactly one level. Maximizes total effective cache size but complicates coherence lookups.
- NINE (Non-Inclusive, Non-Exclusive): No strict guarantee -- a line may exist at multiple levels. Balances capacity and coherence cost.
Cache Coherence: MESI Protocol
When multiple cores cache the same memory address, hardware must keep them consistent. The MESI protocol assigns each cache line one of four states:
| State | Meaning |
|---|---|
| Modified | This core has the only copy and it is dirty (differs from memory) |
| Exclusive | This core has the only copy and it is clean |
| Shared | Multiple cores may hold clean copies |
| Invalid | This line is not valid -- must fetch on next access |
A write to a Shared line triggers an invalidation broadcast, forcing other cores to mark their copies Invalid before the writer transitions to Modified.
Write-Back vs Write-Through
- Write-back: Dirty data is written to the next level only when the line is evicted. Reduces bus traffic dramatically and is the standard policy in modern CPUs.
- Write-through: Every write propagates immediately to the next level. Simpler but generates far more memory traffic; used mainly in embedded or specialized designs.
Why Cache Hierarchy Matters in Practice
Example: Matrix Multiplication Performance
Consider multiplying two 4096x4096 double matrices. The naive triple-loop iterates over columns of B, jumping 32 KB per row access. Each access misses L1 and often L2, hitting L3 or main memory.
By tiling (blocking) the computation into 64x64 sub-matrices, the working set of each tile fits entirely in L1 D-cache (~32 KB for 64x64 doubles = 32 KB). This single optimization can yield a 10-20x speedup purely from cache effects.
Real-world implications:
- Database buffer pools align pages to cache line boundaries so B-tree traversals stay in L1/L2.
- JVM JIT compilers lay out object fields to minimize cache line straddles.
- Linux's per-CPU data structures avoid cross-core L3 coherence traffic by giving each core its own copy.
- LMAX Disruptor pads ring buffer entries to prevent false sharing -- two cores writing adjacent fields in the same cache line, causing constant MESI invalidation ping-pong.