Surviving Crashes After Commit

Prerequisites(3)

ACID Properties Write-Ahead Log (WAL)Crash Recovery (ARIES)

Durability is the "D" in ACID and provides the guarantee that once a transaction has been committed, its changes will survive any subsequent failure — power outages, kernel panics, disk errors, or application crashes. The data is permanently recorded and will be present when the system restarts.

WAL + fsync: The Core Mechanism

Durability is primarily implemented through the Write-Ahead Log (WAL) combined with the fsync system call. When a transaction commits, the database writes a COMMIT record to the WAL and then calls fsync() (or fdatasync()) to force the operating system to flush the WAL data from its in-memory page cache to the physical storage device. Only after fsync returns successfully does the database acknowledge the commit to the client. This is the critical contract: the commit response means the data is on stable storage.

Why fsync Matters

Without fsync, the OS may buffer writes in memory for seconds or even minutes before flushing them to disk. If the machine loses power during that window, committed data is lost. The fsync call eliminates this window by forcing the OS and the disk controller to write through to the physical medium. However, fsync is slow — it stalls the calling thread until the I/O completes, which is typically 1-10ms for spinning disks and 50-200us for SSDs.

Group Commit

Because fsync is expensive, databases use group commit to amortize its cost. Instead of fsyncing after every single transaction, the database batches multiple transactions and issues a single fsync for the entire group. A short delay (often 1-10ms) is introduced: the database waits briefly after a commit request, collects other transactions that also committed during that window, and then issues one fsync for all of them. This can improve throughput by 10-100x under high concurrency, while adding only milliseconds of latency.

Battery-Backed Write Cache (BBWC)

Enterprise storage controllers often include a battery-backed write cache (also called a non-volatile write cache). This hardware component has its own battery that can preserve the cache contents during a power failure. With BBWC, the storage controller can acknowledge a write immediately (the data is safe in the battery-backed cache) without waiting for the physical platters or flash cells. The database can then skip fsync or treat it as a no-op, dramatically improving performance. However, this relies on trusting the hardware — if the battery fails, data can be lost.

Double-Write Buffer (InnoDB)

A subtle durability threat is the torn page problem. Database pages are typically 16KB, but disk sectors are 512 bytes or 4KB. If a crash occurs while a 16KB page is being written, some sectors may contain the new data and others the old data — the page is "torn" and corrupted. MySQL InnoDB solves this with a double-write buffer: before writing a dirty page to its final location, InnoDB first writes it to a dedicated double-write area on disk. If a crash tears the final write, recovery uses the intact copy from the double-write buffer to restore the page.

Replication for Cross-Machine Durability

A single machine's disk can fail entirely. Synchronous replication extends durability across machines: the primary database does not acknowledge a commit until at least one replica has also written and fsynced the WAL record. This protects against total machine failure but adds network latency to every commit. Asynchronous replication is faster but risks losing recently committed transactions if the primary fails before the replica catches up. The choice between synchronous and asynchronous replication is a fundamental trade-off between durability and performance.

Durability Mechanisms in Production

Real-World Example

PostgreSQL WAL + fsync:

-- PostgreSQL defaults:
-- fsync = on              (force WAL to disk on commit)
-- synchronous_commit = on (wait for fsync before acknowledging)
-- wal_sync_method = fdatasync (Linux default)

-- For maximum throughput (at risk of losing last few ms of commits):
-- synchronous_commit = off  (return immediately, fsync in background)

Group Commit in PostgreSQL:

-- commit_delay = 10        (wait 10 microseconds for group commit)
-- commit_siblings = 5      (only delay if 5+ transactions are active)
-- Result: one fsync covers multiple transactions

MySQL InnoDB Double-Write Buffer:

-- innodb_doublewrite = ON (default)
-- Write path:
--   1. Write dirty page to double-write buffer (sequential, fsynced)
--   2. Write dirty page to final tablespace location
--   3. If crash tears step 2, recovery copies from double-write buffer

-- innodb_flush_log_at_trx_commit = 1 (default, fsync on every commit)
-- innodb_flush_log_at_trx_commit = 2 (write to OS cache, fsync per second)
-- innodb_flush_log_at_trx_commit = 0 (no flush, highest risk)

Synchronous Replication (PostgreSQL):

-- On primary:
-- synchronous_standby_names = 'replica1'
-- Commit sequence:
--   1. Write WAL record locally
--   2. Send WAL record to replica1
--   3. Wait for replica1 to fsync and acknowledge
--   4. Only THEN return "COMMIT" to client
-- If primary disk dies, replica1 has the committed data

Each layer adds durability at the cost of latency. Production systems choose the combination that matches their tolerance for data loss.

**Durability** is the "D" in ACID and provides the guarantee that once a transaction has been committed, its changes will survive any subsequent failure — power outages, kernel panics, disk errors, or application crashes. The data is permanently recorded and will be present when the system restarts.

### WAL + fsync: The Core Mechanism

Durability is primarily implemented through the **Write-Ahead Log (WAL)** combined with the **fsync** system call. When a transaction commits, the database writes a COMMIT record to the WAL and then calls `fsync()` (or `fdatasync()`) to force the operating system to flush the WAL data from its in-memory page cache to the physical storage device. Only after fsync returns successfully does the database acknowledge the commit to the client. This is the critical contract: the commit response means the data is on stable storage.

### Why fsync Matters

Without fsync, the OS may buffer writes in memory for seconds or even minutes before flushing them to disk. If the machine loses power during that window, committed data is lost. The fsync call eliminates this window by forcing the OS and the disk controller to write through to the physical medium. However, fsync is **slow** — it stalls the calling thread until the I/O completes, which is typically 1-10ms for spinning disks and 50-200us for SSDs.

### Group Commit

Because fsync is expensive, databases use **group commit** to amortize its cost. Instead of fsyncing after every single transaction, the database batches multiple transactions and issues a single fsync for the entire group. A short delay (often 1-10ms) is introduced: the database waits briefly after a commit request, collects other transactions that also committed during that window, and then issues one fsync for all of them. This can improve throughput by 10-100x under high concurrency, while adding only milliseconds of latency.

### Battery-Backed Write Cache (BBWC)

Enterprise storage controllers often include a **battery-backed write cache** (also called a non-volatile write cache). This hardware component has its own battery that can preserve the cache contents during a power failure. With BBWC, the storage controller can acknowledge a write immediately (the data is safe in the battery-backed cache) without waiting for the physical platters or flash cells. The database can then skip fsync or treat it as a no-op, dramatically improving performance. However, this relies on trusting the hardware — if the battery fails, data can be lost.

### Double-Write Buffer (InnoDB)

A subtle durability threat is the **torn page** problem. Database pages are typically 16KB, but disk sectors are 512 bytes or 4KB. If a crash occurs while a 16KB page is being written, some sectors may contain the new data and others the old data — the page is "torn" and corrupted. MySQL InnoDB solves this with a **double-write buffer**: before writing a dirty page to its final location, InnoDB first writes it to a dedicated double-write area on disk. If a crash tears the final write, recovery uses the intact copy from the double-write buffer to restore the page.

### Replication for Cross-Machine Durability

A single machine's disk can fail entirely. **Synchronous replication** extends durability across machines: the primary database does not acknowledge a commit until at least one replica has also written and fsynced the WAL record. This protects against total machine failure but adds network latency to every commit. **Asynchronous replication** is faster but risks losing recently committed transactions if the primary fails before the replica catches up. The choice between synchronous and asynchronous replication is a fundamental trade-off between durability and performance.

**PostgreSQL WAL + fsync:**
```
-- PostgreSQL defaults:
-- fsync = on              (force WAL to disk on commit)
-- synchronous_commit = on (wait for fsync before acknowledging)
-- wal_sync_method = fdatasync (Linux default)

-- For maximum throughput (at risk of losing last few ms of commits):
-- synchronous_commit = off  (return immediately, fsync in background)
```

**Group Commit in PostgreSQL:**
```
-- commit_delay = 10        (wait 10 microseconds for group commit)
-- commit_siblings = 5      (only delay if 5+ transactions are active)
-- Result: one fsync covers multiple transactions
```

**MySQL InnoDB Double-Write Buffer:**
```
-- innodb_doublewrite = ON (default)
-- Write path:
--   1. Write dirty page to double-write buffer (sequential, fsynced)
--   2. Write dirty page to final tablespace location
--   3. If crash tears step 2, recovery copies from double-write buffer

-- innodb_flush_log_at_trx_commit = 1 (default, fsync on every commit)
-- innodb_flush_log_at_trx_commit = 2 (write to OS cache, fsync per second)
-- innodb_flush_log_at_trx_commit = 0 (no flush, highest risk)
```

**Synchronous Replication (PostgreSQL):**
```
-- On primary:
-- synchronous_standby_names = 'replica1'
-- Commit sequence:
--   1. Write WAL record locally
--   2. Send WAL record to replica1
--   3. Wait for replica1 to fsync and acknowledge
--   4. Only THEN return "COMMIT" to client
-- If primary disk dies, replica1 has the committed data
```

Each layer adds durability at the cost of latency. Production systems choose the combination that matches their tolerance for data loss.

Durability: From Commit to Disk

Step 1 of 2