Lossless Compression for Networking

Prerequisites(1)

Builds on to(2)

Lossless compression reduces data size without losing any information — the original data can be perfectly reconstructed. This is essential for text, code, API responses, and any data where every byte matters. Two fundamental techniques underpin most modern compression algorithms.

Huffman Coding

Huffman coding assigns variable-length binary codes to symbols based on their frequency. Frequent symbols get shorter codes, rare symbols get longer codes. A text where 'e' appears 1000 times and 'z' appears twice might encode 'e' as 10 (2 bits) and 'z' as 110100111 (9 bits). On average, this produces a shorter encoding than using a fixed 8 bits per character. Huffman coding is optimal among prefix-free codes (no code is a prefix of another, ensuring unambiguous decoding).

LZ77 (Lempel-Ziv 1977)

LZ77 exploits repeated patterns in data. It maintains a sliding window over recently seen data. When the encoder encounters a sequence it has seen before, instead of emitting the literal bytes, it emits a back-reference: (distance, length) — meaning "go back distance bytes and copy length bytes." For example, in "ABCABCABC", the second "ABC" can be encoded as "go back 3, copy 3" and the third as "go back 3, copy 3" again. Text, HTML, JSON, and source code contain enormous amounts of repetition, making LZ77 highly effective.

DEFLATE (gzip)

gzip uses the DEFLATE algorithm, which combines LZ77 and Huffman coding: first find repeated sequences (LZ77), then Huffman-encode the resulting stream of literals and back-references. gzip wraps DEFLATE with a header containing a checksum (CRC-32) and original file size. It has been the standard HTTP compression since the 1990s — virtually every web server and browser supports it. Compression level 1-9 trades speed for ratio (level 6 is the typical default).

Zstandard (zstd)

zstd was developed at Facebook (now Meta) and released in 2016. It uses finite state entropy (FSE) coding (a faster alternative to Huffman) combined with an LZ77-like scheme. Its key innovation is dictionary-based compression: you can train a dictionary on representative data and share it between compressor and decompressor. This dramatically improves compression of small payloads (like individual JSON API responses) where there is not enough data for LZ77 to find patterns within a single message. zstd typically achieves better compression ratios than gzip at 3-5x faster compression speed, and decompression is significantly faster.

Brotli

Brotli was developed by Google and standardized in RFC 7932 (2016). It uses a combination of LZ77, Huffman coding, and a pre-built static dictionary of 13,504 common words and phrases found in web content (HTML tags, CSS properties, JavaScript keywords, common English words). This static dictionary gives Brotli a significant advantage for web content — it can reference "the" or "<div" from the dictionary instead of waiting to encounter them in the data stream. Brotli achieves 20-25% better compression than gzip for web content, at the cost of slower compression speed (especially at higher levels). Decompression speed is comparable to gzip.

HTTP Content-Encoding

The client advertises supported compression via the Accept-Encoding header:

Accept-Encoding: gzip, deflate, br, zstd

The server compresses the response and indicates the algorithm used:

Content-Encoding: br

The browser decompresses transparently. br = Brotli, gzip = gzip, zstd = Zstandard.

When NOT to Compress

Compression is counterproductive for:

Already compressed formats: JPEG, PNG, MP4, MP3, ZIP, gzip — these are already entropy-coded and will not compress further. Attempting to compress them wastes CPU and may even increase size slightly.
Encrypted data: TLS/SSL ciphertext looks like random bytes with maximum entropy. Compression cannot find patterns in random data.
Very small payloads: the compression header overhead (e.g., gzip adds ~20 bytes) can exceed the savings on tiny payloads (<100 bytes).
Latency-sensitive paths: compression adds CPU latency. For sub-millisecond response time requirements, the compression overhead may not be worth the bandwidth savings.

Compression Algorithms Compared

Real-World Example

Compressing a 100 KB HTML file (typical web page):

Algorithm	Compressed Size	Ratio	Compress Time	Decompress Time
gzip -6	~25 KB	4.0x	3 ms	0.5 ms
zstd -3	~23 KB	4.3x	0.8 ms	0.3 ms
brotli -4	~21 KB	4.8x	4 ms	0.5 ms
brotli -11	~18 KB	5.6x	500 ms	0.5 ms

Key observations:

zstd -3 compresses almost as well as gzip while being 3-4x faster. It is the best choice for real-time compression (log shipping, message queues).
Brotli -4 beats gzip in both ratio and decompression speed, making it ideal for static web assets that are compressed once and served many times.
Brotli -11 achieves the best ratio but takes 100x longer to compress. Only suitable for pre-compressed static assets (CSS, JS bundles) where you compress at build time.

Real-world usage:

CDNs (Cloudflare, Akamai): serve Brotli to supported browsers, fall back to gzip for older clients.
Kafka: uses zstd for message compression — the dictionary feature lets it compress small messages efficiently across a batch.
Facebook: developed zstd specifically for compressing small key-value store entries (< 1 KB each) where gzip performed poorly.
nginx: supports gzip natively, Brotli via module. Can compress responses on-the-fly or serve pre-compressed static files (.gz, .br).

**Lossless compression** reduces data size without losing any information — the original data can be perfectly reconstructed. This is essential for text, code, API responses, and any data where every byte matters. Two fundamental techniques underpin most modern compression algorithms.

### Huffman Coding

**Huffman coding** assigns variable-length binary codes to symbols based on their frequency. Frequent symbols get shorter codes, rare symbols get longer codes. A text where 'e' appears 1000 times and 'z' appears twice might encode 'e' as `10` (2 bits) and 'z' as `110100111` (9 bits). On average, this produces a shorter encoding than using a fixed 8 bits per character. Huffman coding is **optimal** among prefix-free codes (no code is a prefix of another, ensuring unambiguous decoding).

### LZ77 (Lempel-Ziv 1977)

**LZ77** exploits **repeated patterns** in data. It maintains a sliding window over recently seen data. When the encoder encounters a sequence it has seen before, instead of emitting the literal bytes, it emits a **back-reference**: (distance, length) — meaning "go back *distance* bytes and copy *length* bytes." For example, in "ABCABCABC", the second "ABC" can be encoded as "go back 3, copy 3" and the third as "go back 3, copy 3" again. Text, HTML, JSON, and source code contain enormous amounts of repetition, making LZ77 highly effective.

### DEFLATE (gzip)

**gzip** uses the **DEFLATE** algorithm, which combines LZ77 and Huffman coding: first find repeated sequences (LZ77), then Huffman-encode the resulting stream of literals and back-references. gzip wraps DEFLATE with a header containing a checksum (CRC-32) and original file size. It has been the standard HTTP compression since the 1990s — virtually every web server and browser supports it. Compression level 1-9 trades speed for ratio (level 6 is the typical default).

### Zstandard (zstd)

**zstd** was developed at Facebook (now Meta) and released in 2016. It uses **finite state entropy (FSE)** coding (a faster alternative to Huffman) combined with an LZ77-like scheme. Its key innovation is **dictionary-based compression**: you can train a dictionary on representative data and share it between compressor and decompressor. This dramatically improves compression of small payloads (like individual JSON API responses) where there is not enough data for LZ77 to find patterns within a single message. zstd typically achieves **better compression ratios than gzip at 3-5x faster** compression speed, and decompression is significantly faster.

### Brotli

**Brotli** was developed by Google and standardized in RFC 7932 (2016). It uses a combination of LZ77, Huffman coding, and a **pre-built static dictionary** of 13,504 common words and phrases found in web content (HTML tags, CSS properties, JavaScript keywords, common English words). This static dictionary gives Brotli a significant advantage for web content — it can reference "the" or " div" from the dictionary instead of waiting to encounter them in the data stream. Brotli achieves **20-25% better compression than gzip** for web content, at the cost of slower compression speed (especially at higher levels). Decompression speed is comparable to gzip.

### HTTP Content-Encoding

The client advertises supported compression via the `Accept-Encoding` header:
```
Accept-Encoding: gzip, deflate, br, zstd
```

The server compresses the response and indicates the algorithm used:
```
Content-Encoding: br
```

The browser decompresses transparently. `br` = Brotli, `gzip` = gzip, `zstd` = Zstandard.

### When NOT to Compress

Compression is counterproductive for:
- **Already compressed formats**: JPEG, PNG, MP4, MP3, ZIP, gzip — these are already entropy-coded and will not compress further. Attempting to compress them wastes CPU and may even increase size slightly.
- **Encrypted data**: TLS/SSL ciphertext looks like random bytes with maximum entropy. Compression cannot find patterns in random data.
- **Very small payloads**: the compression header overhead (e.g., gzip adds ~20 bytes) can exceed the savings on tiny payloads ( 100 bytes).
- **Latency-sensitive paths**: compression adds CPU latency. For sub-millisecond response time requirements, the compression overhead may not be worth the bandwidth savings.

**Compressing a 100 KB HTML file (typical web page):**

| Algorithm | Compressed Size | Ratio | Compress Time | Decompress Time |
|-----------|----------------|-------|---------------|-----------------|
| gzip -6   | ~25 KB         | 4.0x  | 3 ms          | 0.5 ms          |
| zstd -3   | ~23 KB         | 4.3x  | 0.8 ms        | 0.3 ms          |
| brotli -4 | ~21 KB         | 4.8x  | 4 ms          | 0.5 ms          |
| brotli -11| ~18 KB         | 5.6x  | 500 ms        | 0.5 ms          |

Key observations:
- **zstd -3** compresses almost as well as gzip while being 3-4x faster. It is the best choice for real-time compression (log shipping, message queues).
- **Brotli -4** beats gzip in both ratio and decompression speed, making it ideal for static web assets that are compressed once and served many times.
- **Brotli -11** achieves the best ratio but takes 100x longer to compress. Only suitable for pre-compressed static assets (CSS, JS bundles) where you compress at build time.

**Real-world usage:**
- **CDNs (Cloudflare, Akamai)**: serve Brotli to supported browsers, fall back to gzip for older clients.
- **Kafka**: uses zstd for message compression — the dictionary feature lets it compress small messages efficiently across a batch.
- **Facebook**: developed zstd specifically for compressing small key-value store entries (  1 KB each) where gzip performed poorly.
- **nginx**: supports gzip natively, Brotli via module. Can compress responses on-the-fly or serve pre-compressed static files (`.gz`, `.br`).

LZ77 Back-Reference Compression

Step 1 of 2