Why Raw Binary Cannot Be Sent Directly
When transmitting bits over a wire or fiber, the sender and receiver must agree on when each bit starts and ends. Simply holding a voltage high for 1 and low for 0 seems obvious, but it fails in practice for two critical reasons:
Problem 1: Clock Recovery
The receiver does not share a clock with the sender. It must extract timing information from the signal itself. If the sender transmits a long run of identical bits — say, fifty consecutive 1s — the signal stays at a constant voltage. The receiver sees a flat line and cannot tell where one bit ends and the next begins. The clocks drift apart, and the receiver misaligns.
Problem 2: DC Bias
If the data contains more 1s than 0s (or vice versa), the average voltage shifts away from zero. This DC bias causes problems with AC-coupled links (transformers, capacitors in the signal path) because they block the constant (DC) component, distorting the signal. Good line encoding ensures roughly equal numbers of high and low transitions, keeping the average voltage near zero (DC balance).
NRZ (Non-Return-to-Zero)
NRZ is the simplest encoding: high voltage = 1, low voltage = 0. It is efficient (1 bit per symbol) but suffers from both problems above. Long runs of the same bit destroy clock sync and create DC bias. NRZ is used only where a separate clock line is available (e.g., SPI bus).
Manchester Encoding
Manchester encoding solves clock recovery by guaranteeing a transition in the middle of every bit period: low-to-high = 1, high-to-low = 0 (IEEE convention). This makes the signal self-clocking — the receiver locks onto the mid-bit transitions. The downside is that the baud rate is double the bit rate — each bit requires two signal changes, halving bandwidth efficiency. Manchester encoding was used by classic 10 Mbps Ethernet.
4b/5b Encoding
Instead of doubling the baud rate, 4b/5b maps every 4 data bits to a 5-bit codeword chosen to guarantee no more than three consecutive identical bits. This ensures enough transitions for clock recovery. The overhead is 25% (5 symbols for 4 bits). 4b/5b was used in 100 Mbps Fast Ethernet (100BASE-TX).
8b/10b Encoding
8b/10b maps every 8 data bits to a 10-bit symbol. The codewords are carefully selected to guarantee: (a) DC balance — each codeword has either 5 ones and 5 zeros, or an imbalance of +/- 1 that alternates using a running disparity tracker; (b) sufficient transitions — no more than 5 consecutive identical bits. The overhead is 25% (10 symbols per 8 bits). 8b/10b is used in PCIe (Gen 1-2), SATA, USB 3.0, and Gigabit Ethernet (1000BASE-X). It also includes special control characters (K-codes) for framing, idle, and link management.
64b/66b Encoding
For 10 Gigabit Ethernet and beyond, the 25% overhead of 8b/10b became too costly. 64b/66b maps 64 data bits to a 66-bit block by prepending a 2-bit sync header (01 for data, 10 for control). The overhead drops to just 3%. Scrambling (a linear-feedback shift register applied to the payload) ensures sufficient transitions. 64b/66b is used in 10GBASE-R, 25G, 40G, and 100G Ethernet.
Scrambling
Scrambling XORs the data stream with a pseudo-random bit sequence generated by a known polynomial. The receiver applies the same polynomial to recover the original data. Scrambling breaks up long runs and eliminates DC bias without adding overhead bits. It is combined with 64b/66b and used in SONET/SDH and DSL.
Real-Life: Why PCIe Gen 3 Switched from 8b/10b to 128b/130b
PCIe Gen 1 and Gen 2 use 8b/10b encoding. Each generation doubled the raw signaling rate: Gen 1 runs at 2.5 GT/s (gigatransfers per second), Gen 2 at 5.0 GT/s. But the effective data rate is only 80% of the raw rate due to the 25% encoding overhead. Gen 1 delivers 2.0 Gbps per lane; Gen 2 delivers 4.0 Gbps per lane.
When PCIe Gen 3 targeted 8.0 GT/s, keeping 8b/10b would have required a raw rate of 10 GT/s for 8 Gbps effective — a huge analog design challenge. Instead, Gen 3 switched to 128b/130b encoding (a variant of 64b/66b), reducing overhead to ~1.5%. This allowed 8 GT/s raw rate to deliver ~7.88 Gbps effective — nearly doubling Gen 2 without doubling the signaling rate.
The lesson: Encoding overhead directly impacts real-world throughput. Moving from 25% overhead (8b/10b) to 1.5% overhead (128b/130b) was as impactful as doubling the clock speed.
Quick overhead comparison:
| Encoding | Raw bits per data bit | Overhead | Used by |
|---|---|---|---|
| Manchester | 2 | 100% | 10BASE-T Ethernet |
| 4b/5b | 1.25 | 25% | 100BASE-TX |
| 8b/10b | 1.25 | 25% | PCIe Gen1-2, SATA, USB 3 |
| 64b/66b | 1.03 | ~3% | 10G+ Ethernet |
| 128b/130b | 1.015 | ~1.5% | PCIe Gen 3+ |