Back to DAG

Write-Ahead Log (WAL)

databases

What is a Write-Ahead Log?

The Write-Ahead Log (WAL) is a fundamental technique for ensuring crash recovery and durability in database systems. The core rule is simple but powerful: before any change is applied to a data page on disk, a log record describing that change must first be written to the WAL and fsynced to stable storage.

Why "write ahead"?

When a transaction modifies a row, the database first writes the change to an in-memory buffer (the dirty page in the buffer pool). If the system crashes before this dirty page is flushed to disk, the change is lost. The WAL prevents this: because the log record was persisted before the data page was modified, the database can replay the log after a crash to reconstruct all committed changes.

Log record structure

Each WAL record typically contains:

  • LSN (Log Sequence Number): a monotonically increasing identifier for this log record. LSNs establish a total ordering of all changes.
  • Transaction ID: which transaction produced this change.
  • Previous LSN: the LSN of the previous log record for this transaction (forms a per-transaction chain for rollback).
  • Page ID: which data page was modified.
  • Before-image: the old value of the modified data (for UNDO during rollback).
  • After-image: the new value (for REDO during recovery).

Sequential I/O = fast

The WAL is an append-only file. Writing to it requires only sequential I/O — the disk head does not need to seek. This is critical because random I/O (updating data pages in place) is 50–100x slower than sequential I/O on HDDs and still significantly slower on SSDs. By batching changes into the sequential log, databases achieve high write throughput.

Group commit

Calling fsync() after every single transaction would be slow (an fsync can take 1–10ms). Group commit batches the WAL records of multiple concurrent transactions into a single fsync. If 100 transactions commit within a 1ms window, one fsync durably persists all of them. This amortizes the cost of fsync across many transactions, dramatically improving throughput.

Crash recovery with WAL

After a crash, the database performs two phases:

  1. REDO: scan the WAL forward from the last checkpoint. For every log record whose LSN is greater than the page's on-disk LSN, reapply the change. This ensures all committed changes are on disk.
  2. UNDO: for any transaction that was active (not committed) at crash time, walk its log chain backward using the Previous LSN pointers and reverse each change using the before-images.

This is the foundation of the ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) recovery algorithm used in most modern databases.

Real-Life: PostgreSQL WAL

Real-World Example

PostgreSQL stores its WAL in the pg_wal/ directory as a sequence of 16 MB segment files (e.g., 000000010000000000000001).

What happens during a transaction:

  1. BEGIN; — a transaction ID is assigned.
  2. UPDATE accounts SET balance = 500 WHERE id = 1; — the old and new values are written as a WAL record. The data page in the buffer pool is modified in memory (marked dirty).
  3. COMMIT; — a commit record is written to the WAL, and fsync() is called to ensure the WAL is on stable storage. The transaction is now durable. The dirty data page may or may not have been flushed yet — it does not matter because the WAL has the information needed to reconstruct it.

Group commit in PostgreSQL:

  • The GUC commit_delay (default 0) can be set to add a brief delay before fsync, allowing more transactions to join the batch.
  • The GUC wal_writer_delay (default 200ms) controls how often the WAL writer flushes.
  • Under high concurrency, PostgreSQL naturally groups commits: multiple backends write their WAL records, and one of them calls fsync, durably persisting all of them.

Why this matters: without the WAL, PostgreSQL would need to fsync every modified data page at commit time. A single UPDATE might touch pages scattered across the disk — random I/O. The WAL converts this into one sequential write.

WAL: Write Path and Crash Recovery

Write Path: WAL record is persisted BEFORE data page Transaction UPDATE x SET v=5 1. write WAL (append-only) LSN:101 old:3 new:5 LSN:102 COMMIT + fsync 2. modify in memory Buffer Pool (RAM) Page 5: v=5 (dirty) 3. flush later Data Pages on Disk Crash Recovery: 1. REDO: replay WAL from last checkpoint. Reapply committed changes whose LSN > page's on-disk LSN. 2. UNDO: for uncommitted transactions, reverse changes using before-images. Result: database is restored to a consistent state, no data loss for committed txns.
Step 1 of 3