The Minimum Unit of Cache Transfer
A cache line (or cache block) is the minimum unit of data transferred between main memory and the CPU cache. On virtually all modern processors, a cache line is 64 bytes. When the CPU reads a single byte from memory, the entire 64-byte block containing that byte is fetched into the cache. This design exploits spatial locality — if you access address N, you are very likely to access N+1, N+2, etc. shortly afterward.
Address decomposition
When the CPU accesses a memory address, the cache hardware splits it into three fields:
| Field | Purpose |
|---|---|
| Tag | Identifies which block of memory is stored in this cache line. Compared against the stored tag to determine a hit or miss. |
| Index | Selects which cache set (row) to look in. Acts like a hash to find the candidate location(s). |
| Offset | Selects the specific byte within the 64-byte cache line. For 64-byte lines, this is 6 bits (2^6 = 64). |
For example, in a 32 KB direct-mapped cache with 64-byte lines: 512 lines total, 9 index bits, 6 offset bits, and 17 tag bits (for a 32-bit address).
Cache associativity
- Direct-mapped — each memory block maps to exactly one cache line (index determines the slot). Fast lookup but prone to conflict misses when different addresses map to the same index.
- Set-associative — each index maps to a set of N lines (N-way). The hardware checks all N tags in parallel. An 8-way set-associative cache balances speed and flexibility. Most modern L1 and L2 caches use 4-way to 16-way associativity.
- Fully-associative — a block can go in any line. Eliminates conflict misses but requires comparing every tag, making it expensive. Only practical for small structures like TLBs.
Cache miss types (the 3 C's)
- Compulsory (cold) misses — the very first access to a block always misses since it has never been cached.
- Capacity misses — the working set exceeds cache size. Even with perfect placement, some blocks must be evicted.
- Conflict misses — in direct-mapped or set-associative caches, multiple blocks compete for the same set. Increasing associativity reduces these.
False sharing
In multi-core systems, each core has its own L1 cache but they must stay coherent. False sharing occurs when two cores write to different variables that happen to reside on the same cache line. Each write invalidates the other core's copy of the entire line, causing expensive coherence traffic — even though the cores are not logically sharing data. The fix is to pad data structures so that per-core variables occupy separate cache lines (often called "cache line padding" or using @Contended in Java).
Spatial Locality in Action
Consider iterating over a 2D array in C (row-major layout):
Row-major traversal (cache-friendly):
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
sum += matrix[i][j]; // sequential in memory
Each cache line holds 16 consecutive ints (64 bytes / 4 bytes). After the first access in a row, the next 15 are free cache hits. Miss rate: ~1/16 = 6.25%.
Column-major traversal (cache-hostile):
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
sum += matrix[i][j]; // jumps by N*4 bytes each step
Each access lands on a different cache line. For large N, every access is a miss. The column-major version can be 10-20x slower than row-major for large matrices — solely due to cache line behavior.
False sharing example (Java):
class Counters {
volatile long counter1; // core 0 writes this
volatile long counter2; // core 1 writes this
// Both in the same 64-byte cache line!
}
Every write to counter1 invalidates core 1's cache line, and vice versa, even though they are independent variables.