System Bus Architecture and PCIe

Prerequisites(1)

Builds on to(1)

A system bus is the communication backbone that connects the CPU, memory, and I/O devices. Historically this was a single shared bus, but modern systems use a hierarchy of point-to-point interconnects for performance. Understanding bus architecture explains why some I/O operations are fast, others are slow, and how devices interact with the CPU and memory.

The Three Logical Buses

Even in modern point-to-point designs, the three logical functions of a bus remain:

Address Bus: Carries the memory or I/O address the CPU wants to access. The width determines the addressable space (e.g., 48 bits on x86-64 = 256 TB).
Data Bus: Carries the actual data being read or written. Width determines throughput per transfer (64 bits on modern CPUs = 8 bytes per cycle).
Control Bus: Carries signals that coordinate the transfer -- read/write, interrupt requests, bus grant, clock, etc.

PCIe: Point-to-Point Serial Lanes

PCI Express (PCIe) replaced the older shared PCI bus with a switched fabric of point-to-point serial links. Each link consists of one or more lanes, where each lane is a pair of differential signal wires (one for each direction, full duplex).

Generation	Per-Lane Bandwidth (each direction)	x16 Total
PCIe 3.0	~1 GB/s	~16 GB/s
PCIe 4.0	~2 GB/s	~32 GB/s
PCIe 5.0	~4 GB/s	~64 GB/s
PCIe 6.0	~8 GB/s	~128 GB/s

A GPU typically uses a x16 slot (16 lanes). An NVMe SSD uses x4 (4 lanes). The PCIe switch/root complex routes packets between devices, unlike the old shared bus where all devices competed for the same wires.

DMA: Direct Memory Access

Without DMA, every byte transferred between a device and memory requires the CPU to execute a load/store instruction -- programmed I/O. This wastes CPU cycles on bulk data movement.

DMA allows devices to read/write main memory independently. The CPU sets up a DMA transfer by writing the source address, destination address, and byte count to the DMA controller's registers. The controller then performs the transfer autonomously, sending an interrupt to the CPU when finished. This frees the CPU to execute other work during the transfer.

Modern devices (NVMe SSDs, NICs, GPUs) all use DMA extensively. NVMe command queues, for example, are memory-mapped ring buffers that the device reads via DMA.

Memory-Mapped I/O vs Port-Mapped I/O

Memory-Mapped I/O (MMIO): Device registers are assigned addresses in the physical address space. The CPU reads/writes them using ordinary load/store instructions. This is the dominant approach -- PCIe BARs (Base Address Registers) map device memory into the CPU's address space.
Port-Mapped I/O (PMIO): A separate I/O address space accessed with special instructions (IN/OUT on x86). Legacy -- used for things like the PS/2 keyboard controller and PIT timer.

Interrupt-Driven vs Polling

Interrupt-driven: The device signals the CPU via an interrupt when work completes. Low CPU overhead when events are infrequent, but interrupt handling has latency (~1-5 us for context switch to ISR).
Polling: The CPU repeatedly checks a status register. Zero latency when the event arrives but wastes cycles if the event is rare. High-throughput systems like DPDK and SPDK use busy-polling because the per-packet interrupt overhead at millions of packets/sec would be devastating.

Many systems use hybrid approaches: interrupt coalescing (batch multiple events into one interrupt) or adaptive switching between polling and interrupts based on load.

Real-World: NVMe SSD Over PCIe

Real-World Example

How an NVMe SSD read works end-to-end:

The application calls read(). The OS constructs an NVMe command and places it into a submission queue in host memory.
The CPU writes the submission queue tail pointer to the NVMe controller's doorbell register (MMIO write over PCIe).
The NVMe controller DMA-reads the command from the submission queue in host memory.
The controller reads data from the flash chips internally.
The controller DMA-writes the data directly into the application's buffer in host memory.
The controller posts a completion entry to the completion queue (DMA write) and sends an MSI-X interrupt.
The OS interrupt handler processes the completion and wakes the waiting thread.

The entire data path (steps 3-5) happens over PCIe without CPU involvement, thanks to DMA. A PCIe 4.0 x4 NVMe SSD can sustain ~7 GB/s sequential reads this way.

Why PCIe matters for GPUs: A modern GPU with PCIe 4.0 x16 has 32 GB/s of bandwidth to host memory. Training a neural network requires shipping gigabytes of weight gradients between GPU and host. PCIe 5.0 (64 GB/s x16) and NVLink (900 GB/s) address this bottleneck.

System Bus Architecture

Step 1 of 2

System Bus & PCIe