Back to DAG

SIMD / Vectorized Instructions

hardware

Single Instruction, Multiple Data

SIMD (Single Instruction, Multiple Data) is a parallel execution model where one instruction operates on multiple data elements simultaneously. Instead of adding two numbers at a time, a SIMD instruction can add four, eight, or sixteen pairs in a single clock cycle using vector registers — wide registers that pack multiple values side by side.

Vector register evolution

SIMD capabilities have grown with each CPU generation:

ExtensionRegister WidthFloats (32-bit) per OpPlatform
SSE128-bit4x86
AVX256-bit8x86
AVX-512512-bit16x86
NEON128-bit4ARM
SVE/SVE2128-2048 bitscalableARM

A 256-bit AVX register can hold eight 32-bit floats or four 64-bit doubles. A single VADDPS instruction adds all eight float pairs simultaneously — an 8x throughput gain over scalar code for that operation.

Data-parallel workloads

SIMD excels when the same operation is applied independently to many elements:

  • Image processing — apply a brightness filter to every pixel. Each pixel's RGB channels can be processed in parallel.
  • Matrix multiplication — the dot product of a row and column involves multiplying and summing corresponding elements, a perfect SIMD pattern.
  • AI inference — neural network layers multiply weight matrices by activation vectors. Libraries like oneDNN and XNNPACK use SIMD heavily.
  • Audio/video codecs — encoding and decoding involve applying transforms (DCT, FFT) across blocks of samples.
  • Physics simulations — updating positions and velocities of many particles with the same equations.

Auto-vectorization

Modern compilers (GCC, Clang, MSVC) can automatically convert scalar loops into SIMD instructions — a process called auto-vectorization. The compiler analyzes loops for independence between iterations (no loop-carried dependencies) and transforms them to operate on vectors. Compiler flags like -O2 or -O3 enable this, and pragmas or attributes can provide hints.

Alignment requirements

SIMD loads and stores are most efficient when data is aligned to the vector width boundary. A 256-bit (32-byte) AVX load from an address divisible by 32 uses an aligned load (VMOVAPS), which is faster than an unaligned load (VMOVUPS). Misaligned accesses may cross cache line boundaries, causing two cache accesses instead of one. Memory allocators like aligned_alloc or compiler attributes (__attribute__((aligned(32)))) ensure proper alignment.

Scalar vs SIMD: Array Addition

Real-World Example

Adding two arrays of 1024 floats element-by-element:

Scalar (one at a time):

for (int i = 0; i < 1024; i++) {
    c[i] = a[i] + b[i];  // 1024 ADD instructions
}

SIMD with AVX (8 at a time):

for (int i = 0; i < 1024; i += 8) {
    __m256 va = _mm256_load_ps(&a[i]);   // load 8 floats
    __m256 vb = _mm256_load_ps(&b[i]);   // load 8 floats
    __m256 vc = _mm256_add_ps(va, vb);   // add 8 pairs
    _mm256_store_ps(&c[i], vc);          // store 8 results
}

The SIMD version issues only 128 add instructions instead of 1024 — a theoretical 8x speedup. In practice, memory bandwidth and other bottlenecks limit gains to 3-6x, but that is still a massive improvement for compute-heavy workloads.

In Java, the Vector API (incubating since JDK 16) provides portable SIMD:

var species = FloatVector.SPECIES_256;
for (int i = 0; i < a.length; i += species.length()) {
    var va = FloatVector.fromArray(species, a, i);
    var vb = FloatVector.fromArray(species, b, i);
    va.add(vb).intoArray(c, i);
}

Scalar vs SIMD Execution

Scalar (1 element/cycle) cycle 1 a[0] + b[0] = c[0] cycle 2 a[1] + b[1] = c[1] cycle 3 a[2] + b[2] = c[2] cycle 4 a[3] + b[3] = c[3] 4 cycles for 4 elements SIMD (4 elements/cycle) cycle 1 a[0] a[1] a[2] a[3] + + + + b[0] b[1] b[2] b[3] = = = = c[0] c[1] c[2] c[3] 1 cycle for 4 elements 128-bit vector register 4 x 32-bit float lanes 4x throughput with 128-bit SIMD
Step 1 of 2