The Von Neumann Architecture
The Von Neumann architecture is the foundational model for nearly every general-purpose computer. Its defining characteristic is a shared memory that holds both program instructions and data. The CPU fetches instructions from the same memory where it reads and writes data, communicating over a common bus.
Key registers
The CPU maintains several special-purpose registers that orchestrate execution:
- Program Counter (PC) — holds the memory address of the next instruction to fetch. After each fetch, the PC increments automatically (unless a branch or jump overrides it).
- Instruction Register (IR) — holds the instruction currently being executed. The control unit decodes the opcode and operand fields from the IR.
- Memory Address Register (MAR) — holds the address the CPU wants to read from or write to in main memory.
- Memory Data Register (MDR) — holds the data just fetched from memory or about to be written to memory.
The fetch-decode-execute cycle
Every instruction passes through three stages:
- Fetch — the address in the PC is copied to the MAR; memory returns the instruction into the MDR; the MDR value is copied to the IR; the PC increments.
- Decode — the control unit interprets the opcode in the IR, determines which functional units and registers are involved, and routes operands.
- Execute — the ALU or other unit performs the operation (arithmetic, load/store, branch). Results are written back to a register or memory.
Pipelining
Because fetch, decode, and execute use different hardware, they can overlap. While instruction N is executing, instruction N+1 is decoding, and instruction N+2 is fetching. This instruction-level pipelining increases throughput without raising the clock speed. Modern CPUs use 10-20+ stage pipelines, though hazards (data dependencies, branches) can cause stalls that reduce the benefit.
Von Neumann vs Harvard architecture
The Von Neumann bottleneck arises because instructions and data share one bus — the CPU often stalls waiting for memory. The Harvard architecture uses separate memories (and buses) for instructions and data, enabling simultaneous instruction fetch and data access. Most modern CPUs use a hybrid: separate L1 caches for instructions and data (Harvard-style), backed by unified main memory (Von Neumann-style).
Market example: Apple M-series (M1/M2/M3) — separate 192KB L1 instruction cache and 128KB L1 data cache per performance core (Harvard at L1), unified L2 and main memory (Von Neumann below). Lets the core fetch the next instruction while loading data from the previous one, avoiding the bottleneck.
Intuition builder
The fetch-decode-execute cycle in one loop — registers and memory flow:
// Von Neumann CPU: one instruction per iteration
while (true) {
// 1. FETCH — PC points to next instruction; load it into IR
MAR = PC;
MDR = memory[MAR];
IR = MDR;
PC = PC + 1;
// 2. DECODE — split instruction into opcode + operands
opcode = IR >> 12;
operand = IR & 0xFFF;
// 3. EXECUTE — dispatch to the right operation
switch (opcode) {
case LOAD: MAR = operand; MDR = memory[MAR]; Rx = MDR; break;
case ADD: Rx = Ra + Rb; break;
case STORE: MAR = operand; memory[MAR] = Rx; break;
case JUMP: PC = operand; break;
}
}
Key insight: PC, MAR, MDR, and IR form a pipeline of addresses and values. The CPU never "thinks" — it just moves data between these registers and memory according to the opcode.
Walking Through a Simple Program
Consider a program that adds two numbers stored in memory:
Address Instruction
0x00 LOAD R1, [0x10] ; load value at address 0x10 into R1
0x01 LOAD R2, [0x11] ; load value at address 0x11 into R2
0x02 ADD R3, R1, R2 ; R3 = R1 + R2
0x03 STORE R3, [0x12] ; write R3 to address 0x12
Cycle-by-cycle (without pipelining):
- PC=0x00: MAR=0x00 -> MDR=LOAD R1,[0x10] -> IR=LOAD R1,[0x10]. Decode: load instruction. Execute: MAR=0x10 -> MDR=value -> R1=value.
- PC=0x01: fetch LOAD R2, decode, execute similarly.
- PC=0x02: fetch ADD, decode (needs R1, R2), execute (ALU adds), write-back to R3.
- PC=0x03: fetch STORE, decode, execute: MAR=0x12, MDR=R3 value -> memory write.
With pipelining, instructions 0x01 and 0x02 overlap with earlier stages, completing the whole program faster. However, instruction 0x02 (ADD) depends on R1 and R2 from the prior LOADs — the pipeline must detect this data hazard and insert a stall or use forwarding to pass the result directly.