Reliable Byte Stream Delivery

Prerequisites(1)

Builds on to(3)

Socket API (POSIX)HTTP/1.1 WAL-Based Replication

Once a TCP connection is established, it provides a reliable, ordered byte stream abstraction to applications. The application writes bytes into TCP, and TCP guarantees that every byte arrives at the receiver, in order, exactly once — regardless of packet loss, reordering, or duplication in the underlying IP network. This section covers the mechanisms that make this guarantee possible.

Sequence Numbers and Acknowledgments

TCP assigns a sequence number to every byte in the stream. The sequence number in a segment's header indicates the position of the first byte of that segment's payload. For example, if the ISN is 1000 and the application writes 500 bytes, the first segment carries Seq=1001 (the byte after the SYN), and the next segment starts at Seq=1501.

The receiver sends back ACK segments with an acknowledgment number indicating the next byte it expects. ACKs are cumulative: ACK=2001 means "I have received all bytes up to 2000; send me byte 2001 next." This is efficient because a single ACK can confirm multiple segments, but it cannot selectively acknowledge out-of-order data (standard TCP; SACK — Selective Acknowledgment — is an extension that addresses this).

Retransmission

TCP uses two mechanisms to detect loss and retransmit:

Timeout-based retransmission (RTO): TCP starts a retransmission timer when it sends a segment. If no ACK arrives before the timer expires, the segment is assumed lost and retransmitted. The Retransmission Timeout (RTO) is computed dynamically from measured round-trip times (RTT): RTO = SRTT + 4 * RTTVAR, where SRTT is the smoothed RTT and RTTVAR is the RTT variance. If the RTO fires, it is doubled (exponential backoff) to avoid flooding a congested network.

Triple duplicate ACK (fast retransmit): If the receiver gets an out-of-order segment, it immediately re-sends the ACK for the last in-order byte (a "duplicate ACK"). When the sender receives three duplicate ACKs for the same sequence number, it infers that the next segment was lost and retransmits it immediately — without waiting for the RTO timer. This is called fast retransmit and recovers from loss much faster than waiting for a timeout.

Sliding Window and Flow Control

TCP uses a sliding window to manage how much data can be in flight (sent but not yet acknowledged). The receiver advertises a window size (rwnd) in every ACK, telling the sender how much buffer space it has available. The sender limits its unacknowledged data to this window size.

The window "slides" forward as ACKs arrive: when bytes at the left edge are acknowledged, the window moves right, allowing new bytes to be sent. This prevents a fast sender from overwhelming a slow receiver. If the receiver's buffer fills up, it advertises rwnd=0, and the sender stops transmitting until the receiver opens the window again (the sender periodically sends window probes to detect when space opens up).

Congestion Control

Flow control prevents overwhelming the receiver; congestion control prevents overwhelming the network. TCP maintains a congestion window (cwnd) that limits the sending rate independently of the receiver's window. The effective window is min(cwnd, rwnd).

Slow start: A new connection starts with cwnd = 1 MSS (Maximum Segment Size). Each ACK doubles the cwnd (exponential growth). Despite the name "slow," this ramps up quickly — after 10 RTTs, cwnd reaches ~1000 MSS.

Congestion avoidance: When cwnd reaches a threshold (ssthresh), growth switches to linear — cwnd increases by 1 MSS per RTT. This cautious growth probes for available bandwidth without causing sudden congestion.

Fast retransmit and fast recovery: On detecting loss via triple duplicate ACK, TCP halves cwnd (sets ssthresh = cwnd/2, cwnd = ssthresh) and enters congestion avoidance — avoiding the slow start penalty. On timeout, TCP resets cwnd to 1 MSS and enters slow start — a more severe reaction because a timeout suggests severe congestion.

Nagle's Algorithm and TCP_NODELAY

Nagle's algorithm reduces the number of small segments on the network: if there is unacknowledged data in flight, TCP buffers new small writes and sends them as one segment when the ACK arrives. This is efficient for bulk transfers but adds latency for interactive applications (e.g., SSH keystrokes, game inputs). Setting the TCP_NODELAY socket option disables Nagle's algorithm, causing each write to be sent immediately as its own segment — trading bandwidth efficiency for lower latency.

Real-Life: Downloading a Large File

Real-World Example

When you download a 100 MB file over TCP, the following mechanisms work together:

Slow start ramp-up: The connection starts with cwnd=1 MSS (1460 bytes). After 1 RTT, cwnd=2. After 2 RTT, cwnd=4. After 10 RTTs (~500ms on a 50ms link), cwnd=1024 segments = ~1.5 MB in flight. The connection can now use most of a typical broadband link.

Steady-state flow: Once cwnd reaches the bandwidth-delay product (e.g., 100 Mbps * 50ms = 625 KB), the sender keeps the pipe full. ACKs stream back, sliding the window forward, and new data streams out at line rate.

Packet loss scenario: Suppose segment 5000 is lost. The receiver gets segments 5001, 5002, 5003 and sends duplicate ACKs for 5000 after each. When the sender receives the third duplicate ACK (total of four ACKs for 5000), it performs fast retransmit: immediately resends segment 5000 without waiting for the RTO timer. It then halves cwnd (fast recovery) and continues.

Flow control in action: If the receiver's application is slow to read from the TCP buffer (e.g., the disk write is slow), the buffer fills up. The receiver advertises rwnd=0. The sender stops sending and starts a persist timer, periodically sending 1-byte window probes. When the application reads data and frees buffer space, the receiver advertises a non-zero window, and the sender resumes.

Nagle's algorithm: During the bulk download, Nagle's algorithm has no visible effect — the sender is always sending full-sized segments. But if you were typing commands over SSH on the same connection, Nagle would batch your keystrokes, causing noticeable delay. SSH sets TCP_NODELAY to send each keystroke immediately.

Why TCP throughput can be poor over long distances: On a 100ms RTT link with 1% loss, TCP's congestion window oscillates between ~100 and ~50 segments (the "sawtooth" pattern). Effective throughput is limited by ~(MSS / RTT) * (1 / sqrt(loss)). This is why protocols like QUIC and BBR congestion control algorithms aim to fill the pipe more efficiently.

TCP Sliding Window and Congestion Control

Step 1 of 3

TCP Reliable Stream