What is a Write-Ahead Log?
The Write-Ahead Log (WAL) is a fundamental technique for ensuring crash recovery and durability in database systems. The core rule is simple but powerful: before any change is applied to a data page on disk, a log record describing that change must first be written to the WAL and fsynced to stable storage.
Why "write ahead"?
When a transaction modifies a row, the database first writes the change to an in-memory buffer (the dirty page in the buffer pool). If the system crashes before this dirty page is flushed to disk, the change is lost. The WAL prevents this: because the log record was persisted before the data page was modified, the database can replay the log after a crash to reconstruct all committed changes.
Log record structure
Each WAL record typically contains:
- LSN (Log Sequence Number): a monotonically increasing identifier for this log record. LSNs establish a total ordering of all changes.
- Transaction ID: which transaction produced this change.
- Previous LSN: the LSN of the previous log record for this transaction (forms a per-transaction chain for rollback).
- Page ID: which data page was modified.
- Before-image: the old value of the modified data (for UNDO during rollback).
- After-image: the new value (for REDO during recovery).
Sequential I/O = fast
The WAL is an append-only file. Writing to it requires only sequential I/O — the disk head does not need to seek. This is critical because random I/O (updating data pages in place) is 50–100x slower than sequential I/O on HDDs and still significantly slower on SSDs. By batching changes into the sequential log, databases achieve high write throughput.
Group commit
Calling fsync() after every single transaction would be slow (an fsync can take 1–10ms). Group commit batches the WAL records of multiple concurrent transactions into a single fsync. If 100 transactions commit within a 1ms window, one fsync durably persists all of them. This amortizes the cost of fsync across many transactions, dramatically improving throughput.
Crash recovery with WAL
After a crash, the database performs two phases:
- REDO: scan the WAL forward from the last checkpoint. For every log record whose LSN is greater than the page's on-disk LSN, reapply the change. This ensures all committed changes are on disk.
- UNDO: for any transaction that was active (not committed) at crash time, walk its log chain backward using the Previous LSN pointers and reverse each change using the before-images.
This is the foundation of the ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) recovery algorithm used in most modern databases.
Real-Life: PostgreSQL WAL
PostgreSQL stores its WAL in the pg_wal/ directory as a sequence of 16 MB segment files (e.g., 000000010000000000000001).
What happens during a transaction:
BEGIN;— a transaction ID is assigned.UPDATE accounts SET balance = 500 WHERE id = 1;— the old and new values are written as a WAL record. The data page in the buffer pool is modified in memory (marked dirty).COMMIT;— a commit record is written to the WAL, andfsync()is called to ensure the WAL is on stable storage. The transaction is now durable. The dirty data page may or may not have been flushed yet — it does not matter because the WAL has the information needed to reconstruct it.
Group commit in PostgreSQL:
- The GUC
commit_delay(default 0) can be set to add a brief delay before fsync, allowing more transactions to join the batch. - The GUC
wal_writer_delay(default 200ms) controls how often the WAL writer flushes. - Under high concurrency, PostgreSQL naturally groups commits: multiple backends write their WAL records, and one of them calls fsync, durably persisting all of them.
Why this matters: without the WAL, PostgreSQL would need to fsync every modified data page at commit time. A single UPDATE might touch pages scattered across the disk — random I/O. The WAL converts this into one sequential write.