Single Instruction, Multiple Data

Prerequisites(2)

CPU / Von Neumann ISA (x86, ARM, RISC-V)

Builds on to(1)

SIMD (Single Instruction, Multiple Data) is a parallel execution model where one instruction operates on multiple data elements simultaneously. Instead of adding two numbers at a time, a SIMD instruction can add four, eight, or sixteen pairs in a single clock cycle using vector registers — wide registers that pack multiple values side by side.

Vector register evolution

SIMD capabilities have grown with each CPU generation:

Extension	Register Width	Floats (32-bit) per Op	Platform
SSE	128-bit	4	x86
AVX	256-bit	8	x86
AVX-512	512-bit	16	x86
NEON	128-bit	4	ARM
SVE/SVE2	128-2048 bit	scalable	ARM

A 256-bit AVX register can hold eight 32-bit floats or four 64-bit doubles. A single VADDPS instruction adds all eight float pairs simultaneously — an 8x throughput gain over scalar code for that operation.

Data-parallel workloads

SIMD excels when the same operation is applied independently to many elements:

Image processing — apply a brightness filter to every pixel. Each pixel's RGB channels can be processed in parallel.
Matrix multiplication — the dot product of a row and column involves multiplying and summing corresponding elements, a perfect SIMD pattern.
AI inference — neural network layers multiply weight matrices by activation vectors. Libraries like oneDNN and XNNPACK use SIMD heavily.
Audio/video codecs — encoding and decoding involve applying transforms (DCT, FFT) across blocks of samples.
Physics simulations — updating positions and velocities of many particles with the same equations.

Auto-vectorization

Modern compilers (GCC, Clang, MSVC) can automatically convert scalar loops into SIMD instructions — a process called auto-vectorization. The compiler analyzes loops for independence between iterations (no loop-carried dependencies) and transforms them to operate on vectors. Compiler flags like -O2 or -O3 enable this, and pragmas or attributes can provide hints.

Alignment requirements

SIMD loads and stores are most efficient when data is aligned to the vector width boundary. A 256-bit (32-byte) AVX load from an address divisible by 32 uses an aligned load (VMOVAPS), which is faster than an unaligned load (VMOVUPS). Misaligned accesses may cross cache line boundaries, causing two cache accesses instead of one. Memory allocators like aligned_alloc or compiler attributes (__attribute__((aligned(32)))) ensure proper alignment.

**SIMD (Single Instruction, Multiple Data)** is a parallel execution model where one instruction operates on multiple data elements simultaneously. Instead of adding two numbers at a time, a SIMD instruction can add four, eight, or sixteen pairs in a single clock cycle using **vector registers** — wide registers that pack multiple values side by side.

### Vector register evolution

SIMD capabilities have grown with each CPU generation:

| Extension | Register Width | Floats (32-bit) per Op | Platform |
|-----------|---------------|----------------------|----------|
| SSE       | 128-bit       | 4                    | x86      |
| AVX       | 256-bit       | 8                    | x86      |
| AVX-512   | 512-bit       | 16                   | x86      |
| NEON      | 128-bit       | 4                    | ARM      |
| SVE/SVE2  | 128-2048 bit  | scalable             | ARM      |

A 256-bit AVX register can hold eight 32-bit floats or four 64-bit doubles. A single `VADDPS` instruction adds all eight float pairs simultaneously — an 8x throughput gain over scalar code for that operation.

### Data-parallel workloads

SIMD excels when the same operation is applied independently to many elements:

- **Image processing** — apply a brightness filter to every pixel. Each pixel's RGB channels can be processed in parallel.
- **Matrix multiplication** — the dot product of a row and column involves multiplying and summing corresponding elements, a perfect SIMD pattern.
- **AI inference** — neural network layers multiply weight matrices by activation vectors. Libraries like oneDNN and XNNPACK use SIMD heavily.
- **Audio/video codecs** — encoding and decoding involve applying transforms (DCT, FFT) across blocks of samples.
- **Physics simulations** — updating positions and velocities of many particles with the same equations.

### Auto-vectorization

Modern compilers (GCC, Clang, MSVC) can automatically convert scalar loops into SIMD instructions — a process called **auto-vectorization**. The compiler analyzes loops for independence between iterations (no loop-carried dependencies) and transforms them to operate on vectors. Compiler flags like `-O2` or `-O3` enable this, and pragmas or attributes can provide hints.

### Alignment requirements

SIMD loads and stores are most efficient when data is **aligned** to the vector width boundary. A 256-bit (32-byte) AVX load from an address divisible by 32 uses an aligned load (`VMOVAPS`), which is faster than an unaligned load (`VMOVUPS`). Misaligned accesses may cross cache line boundaries, causing two cache accesses instead of one. Memory allocators like `aligned_alloc` or compiler attributes (`__attribute__((aligned(32)))`) ensure proper alignment.

Scalar vs SIMD: Array Addition

Real-World Example

Adding two arrays of 1024 floats element-by-element:

Scalar (one at a time):

for (int i = 0; i < 1024; i++) {
    c[i] = a[i] + b[i];  // 1024 ADD instructions
}

SIMD with AVX (8 at a time):

for (int i = 0; i < 1024; i += 8) {
    __m256 va = _mm256_load_ps(&a[i]);   // load 8 floats
    __m256 vb = _mm256_load_ps(&b[i]);   // load 8 floats
    __m256 vc = _mm256_add_ps(va, vb);   // add 8 pairs
    _mm256_store_ps(&c[i], vc);          // store 8 results
}

The SIMD version issues only 128 add instructions instead of 1024 — a theoretical 8x speedup. In practice, memory bandwidth and other bottlenecks limit gains to 3-6x, but that is still a massive improvement for compute-heavy workloads.

In Java, the Vector API (incubating since JDK 16) provides portable SIMD:

var species = FloatVector.SPECIES_256;
for (int i = 0; i < a.length; i += species.length()) {
    var va = FloatVector.fromArray(species, a, i);
    var vb = FloatVector.fromArray(species, b, i);
    va.add(vb).intoArray(c, i);
}

Scalar vs SIMD Execution

Step 1 of 2