The Minimum Unit of Cache Transfer

Prerequisites(2)

Builds on to(1)

A cache line (or cache block) is the minimum unit of data transferred between main memory and the CPU cache. On virtually all modern processors, a cache line is 64 bytes. When the CPU reads a single byte from memory, the entire 64-byte block containing that byte is fetched into the cache. This design exploits spatial locality — if you access address N, you are very likely to access N+1, N+2, etc. shortly afterward.

Address decomposition

When the CPU accesses a memory address, the cache hardware splits it into three fields:

Field	Purpose
Tag	Identifies which block of memory is stored in this cache line. Compared against the stored tag to determine a hit or miss.
Index	Selects which cache set (row) to look in. Acts like a hash to find the candidate location(s).
Offset	Selects the specific byte within the 64-byte cache line. For 64-byte lines, this is 6 bits (2^6 = 64).

For example, in a 32 KB direct-mapped cache with 64-byte lines: 512 lines total, 9 index bits, 6 offset bits, and 17 tag bits (for a 32-bit address).

Cache associativity

Direct-mapped — each memory block maps to exactly one cache line (index determines the slot). Fast lookup but prone to conflict misses when different addresses map to the same index.
Set-associative — each index maps to a set of N lines (N-way). The hardware checks all N tags in parallel. An 8-way set-associative cache balances speed and flexibility. Most modern L1 and L2 caches use 4-way to 16-way associativity.
Fully-associative — a block can go in any line. Eliminates conflict misses but requires comparing every tag, making it expensive. Only practical for small structures like TLBs.

Cache miss types (the 3 C's)

Compulsory (cold) misses — the very first access to a block always misses since it has never been cached.
Capacity misses — the working set exceeds cache size. Even with perfect placement, some blocks must be evicted.
Conflict misses — in direct-mapped or set-associative caches, multiple blocks compete for the same set. Increasing associativity reduces these.

False sharing

In multi-core systems, each core has its own L1 cache but they must stay coherent. False sharing occurs when two cores write to different variables that happen to reside on the same cache line. Each write invalidates the other core's copy of the entire line, causing expensive coherence traffic — even though the cores are not logically sharing data. The fix is to pad data structures so that per-core variables occupy separate cache lines (often called "cache line padding" or using @Contended in Java).

Spatial Locality in Action

Real-World Example

Consider iterating over a 2D array in C (row-major layout):

Row-major traversal (cache-friendly):

for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        sum += matrix[i][j];  // sequential in memory

Each cache line holds 16 consecutive ints (64 bytes / 4 bytes). After the first access in a row, the next 15 are free cache hits. Miss rate: ~1/16 = 6.25%.

Column-major traversal (cache-hostile):

for (int j = 0; j < N; j++)
    for (int i = 0; i < N; i++)
        sum += matrix[i][j];  // jumps by N*4 bytes each step

Each access lands on a different cache line. For large N, every access is a miss. The column-major version can be 10-20x slower than row-major for large matrices — solely due to cache line behavior.

False sharing example (Java):

class Counters {
    volatile long counter1;  // core 0 writes this
    volatile long counter2;  // core 1 writes this
    // Both in the same 64-byte cache line!
}

Every write to counter1 invalidates core 1's cache line, and vice versa, even though they are independent variables.

Address Decomposition for Set-Associative Cache

Step 1 of 3

Cache Line