The Von Neumann Architecture

Prerequisites(3)

Builds on to(7)

ISA (x86, ARM, RISC-V)SIMD / Vectorized Instructions Cache Line CPU Cache (L1/L2/L3)System Bus & PCIe Paging & Page Tables OS Kernel

The Von Neumann architecture is the foundational model for nearly every general-purpose computer. Its defining characteristic is a shared memory that holds both program instructions and data. The CPU fetches instructions from the same memory where it reads and writes data, communicating over a common bus.

Key registers

The CPU maintains several special-purpose registers that orchestrate execution:

Program Counter (PC) — holds the memory address of the next instruction to fetch. After each fetch, the PC increments automatically (unless a branch or jump overrides it).
Instruction Register (IR) — holds the instruction currently being executed. The control unit decodes the opcode and operand fields from the IR.
Memory Address Register (MAR) — holds the address the CPU wants to read from or write to in main memory.
Memory Data Register (MDR) — holds the data just fetched from memory or about to be written to memory.

The fetch-decode-execute cycle

Every instruction passes through three stages:

Fetch — the address in the PC is copied to the MAR; memory returns the instruction into the MDR; the MDR value is copied to the IR; the PC increments.
Decode — the control unit interprets the opcode in the IR, determines which functional units and registers are involved, and routes operands.
Execute — the ALU or other unit performs the operation (arithmetic, load/store, branch). Results are written back to a register or memory.

Pipelining

Because fetch, decode, and execute use different hardware, they can overlap. While instruction N is executing, instruction N+1 is decoding, and instruction N+2 is fetching. This instruction-level pipelining increases throughput without raising the clock speed. Modern CPUs use 10-20+ stage pipelines, though hazards (data dependencies, branches) can cause stalls that reduce the benefit.

Von Neumann vs Harvard architecture

The Von Neumann bottleneck arises because instructions and data share one bus — the CPU often stalls waiting for memory. The Harvard architecture uses separate memories (and buses) for instructions and data, enabling simultaneous instruction fetch and data access. Most modern CPUs use a hybrid: separate L1 caches for instructions and data (Harvard-style), backed by unified main memory (Von Neumann-style).

Market example: Apple M-series (M1/M2/M3) — separate 192KB L1 instruction cache and 128KB L1 data cache per performance core (Harvard at L1), unified L2 and main memory (Von Neumann below). Lets the core fetch the next instruction while loading data from the previous one, avoiding the bottleneck.

Intuition builder

The fetch-decode-execute cycle in one loop — registers and memory flow:

// Von Neumann CPU: one instruction per iteration
while (true) {
    // 1. FETCH — PC points to next instruction; load it into IR
    MAR = PC;
    MDR = memory[MAR];
    IR = MDR;
    PC = PC + 1;

    // 2. DECODE — split instruction into opcode + operands
    opcode = IR >> 12;
    operand = IR & 0xFFF;

    // 3. EXECUTE — dispatch to the right operation
    switch (opcode) {
        case LOAD:  MAR = operand; MDR = memory[MAR]; Rx = MDR; break;
        case ADD:   Rx = Ra + Rb; break;
        case STORE: MAR = operand; memory[MAR] = Rx; break;
        case JUMP:  PC = operand; break;
    }
}

Key insight: PC, MAR, MDR, and IR form a pipeline of addresses and values. The CPU never "thinks" — it just moves data between these registers and memory according to the opcode.

Walking Through a Simple Program

Real-World Example

Consider a program that adds two numbers stored in memory:

Address  Instruction
0x00     LOAD R1, [0x10]    ; load value at address 0x10 into R1
0x01     LOAD R2, [0x11]    ; load value at address 0x11 into R2
0x02     ADD  R3, R1, R2    ; R3 = R1 + R2
0x03     STORE R3, [0x12]   ; write R3 to address 0x12

Cycle-by-cycle (without pipelining):

PC=0x00: MAR=0x00 -> MDR=LOAD R1,[0x10] -> IR=LOAD R1,[0x10]. Decode: load instruction. Execute: MAR=0x10 -> MDR=value -> R1=value.
PC=0x01: fetch LOAD R2, decode, execute similarly.
PC=0x02: fetch ADD, decode (needs R1, R2), execute (ALU adds), write-back to R3.
PC=0x03: fetch STORE, decode, execute: MAR=0x12, MDR=R3 value -> memory write.

With pipelining, instructions 0x01 and 0x02 overlap with earlier stages, completing the whole program faster. However, instruction 0x02 (ADD) depends on R1 and R2 from the prior LOADs — the pipeline must detect this data hazard and insert a stall or use forwarding to pass the result directly.

CPU Block Diagram

Fetch-Decode-Execute Simulator

InteractiveStep through the fetch-decode-execute cycle

Fetch

Decode

Execute

Program

0: LOAD R1, 5

1: LOAD R2, 3

2: ADD R1, R2

3: STORE R1, 0x10

PC: 0

IR: —

R0: 0

R1: 0

R2: 0

R3: 0

Loading demo...

Step 1 of 2

CPU / Von Neumann