Back to DAG

Copy-on-Write (fork COW)

os

Copy-on-Write: Lazy Copying After fork()

Copy-on-Write (COW) is an optimization technique that makes fork() nearly instant, regardless of how much memory the parent process uses. Without COW, fork() would need to copy the parent's entire address space (potentially gigabytes) for the child -- an enormously expensive operation. COW eliminates this cost by deferring the copy until it is actually needed.

How COW Works

When fork() is called:

  1. The kernel does not copy the parent's physical pages. Instead, both parent and child page tables are updated to point to the same physical frames.
  2. All shared pages are marked as read-only in both processes' page tables (even pages that were previously writable).
  3. Both processes can read from these shared pages with zero overhead -- they are accessing the same physical memory.

When either process attempts to write to a shared page:

  1. The CPU raises a page fault (because the page is marked read-only).
  2. The kernel's page fault handler detects that this is a COW page (not a true permissions violation).
  3. The kernel allocates a new physical frame, copies the contents of the original page into it, and updates the faulting process's page table to point to the new copy with write permissions.
  4. The other process retains its mapping to the original page. If only one process has a reference left, the original can be made writable again without copying.
  5. The write instruction is re-executed, this time succeeding on the private copy.

Why COW Matters

ScenarioWithout COWWith COW
fork() a process using 1 GB RAMCopy 1 GB (~250 ms)Update page tables (~microseconds)
fork() + exec()Copies 1 GB, then immediately discards itCopies zero pages (exec replaces them all)
fork() + child reads onlyCopies 1 GB unnecessarilyShares all pages, copies nothing
fork() + child writes 10 pagesCopies entire 1 GBCopies only 10 pages (~40 KB)

The fork+exec pattern benefits enormously from COW. When a shell forks and the child immediately calls exec(), the child's address space is replaced entirely by the new program. With COW, the fork() copied zero pages, and exec() unmaps the shared pages. Without COW, fork() would have copied gigabytes of memory only to throw it all away.

Reference Counting

The kernel maintains a reference count for each physical frame. When fork() maps a page as shared, the count increases. When a COW fault creates a private copy, the count for the original decreases. When the count drops to 1, the remaining process can write directly without triggering a fault. When it drops to 0, the frame is freed.

COW Beyond fork()

Copy-on-write is a general-purpose optimization used in many contexts:

  • Redis background saving: Redis forks to create a snapshot. The child writes the dataset to disk while the parent continues serving requests. Thanks to COW, the child shares the parent's memory and only pages that the parent modifies during the save need to be copied.
  • Filesystem snapshots: Btrfs and ZFS create snapshots using COW -- a snapshot initially shares all blocks with the live filesystem. Only blocks that are subsequently modified are copied.
  • Virtual machine live migration: VM memory pages are shared between source and destination, and only modified pages are re-transferred.

Real-Life: Redis Background Persistence

Real-World Example

Redis stores its entire dataset in memory, but needs to periodically save it to disk for durability. The BGSAVE command demonstrates COW perfectly:

Step 1: Redis calls fork() The Redis server (parent, say 10 GB RSS) forks. Thanks to COW, this takes microseconds -- no data is copied. The child shares all 10 GB of physical pages with the parent.

Step 2: Child writes the snapshot to disk The child iterates over the shared dataset and serializes it to an RDB file on disk. It only reads the shared pages, so no COW faults occur from the child's side.

Step 3: Parent continues serving writes While the child is saving, the parent handles new client commands. When a client modifies a key, the parent writes to a shared page, triggering a COW fault. The kernel copies that one page (~4 KB) so the parent gets a private writable copy. The child still sees the original, consistent snapshot.

Memory overhead: If clients modify 5% of keys during the save, only 5% of pages are copied (~500 MB for a 10 GB dataset). Without COW, Redis would need 20 GB of memory to fork safely.

Warning: Linux's /proc/sys/vm/overcommit_memory setting matters here. By default, Linux may refuse to fork if there is not enough virtual memory to theoretically back both processes. Setting it to 1 allows the fork, relying on COW to keep actual memory usage low. This is why the Redis documentation recommends configuring overcommit.

Another example: Docker containers. When you run multiple containers from the same image, the filesystem layers use COW (via overlay2 or similar). All containers share the same base image blocks; only files that a container modifies are copied into its writable layer.

Copy-on-Write: Fork and Write Fault

1. After fork() -- pages shared (read-only) Parent PT VP0 -> PF0 (RO) VP1 -> PF1 (RO) VP2 -> PF2 (RO) Child PT VP0 -> PF0 (RO) VP1 -> PF1 (RO) VP2 -> PF2 (RO) PF0 PF1 PF2 rc=2 rc=2 rc=2 2. Parent writes VP1 -- COW fault! Parent PT VP0 -> PF0 (RO) VP1 -> PF3 (RW) VP2 -> PF2 (RO) Child PT VP0 -> PF0 (RO) VP1 -> PF1 (RO) VP2 -> PF2 (RO) PF0 PF1 PF2 PF3 new copy! page fault! rc=2 rc=1 rc=2 COW Benefits fork() cost: update page tables only (microseconds, not milliseconds) fork+exec: zero pages copied (exec replaces all mappings anyway) fork+partial write: only modified pages are copied (minimal overhead)
Step 1 of 2