Back to DAG

Copy-on-Write FS (ZFS, Btrfs)

os

Copy-on-Write: Never Overwrite, Always Append

A copy-on-write (COW) filesystem never modifies data in place. When a block needs to be updated, the filesystem writes the new version to a new location on disk, then atomically updates the pointer in the parent metadata to reference the new block. The old block remains untouched until it is explicitly freed. This fundamental design choice -- never overwriting existing data -- has profound implications for consistency, snapshots, and data integrity.

How COW Updates Work

Consider updating a single data block in a file:

  1. The filesystem writes the new data block to a free location on disk.
  2. The inode (or equivalent metadata) is updated to point to the new block. But since the inode itself must not be modified in place, a new copy of the inode is written with the updated pointer.
  3. The directory entry pointing to the inode must be updated, so a new copy of the directory block is written.
  4. This cascades up to the filesystem root (the uberblock or superblock), which is atomically updated to point to the new metadata tree.

The key insight is that the old root pointer, old directory block, old inode, and old data block all remain valid on disk until the new root is committed. If the system crashes at any point before the root pointer update, the filesystem simply uses the old root -- the state is consistent, as if the write never happened. If the crash occurs after the root pointer update, the new data is fully committed.

No Journal Needed

Because COW never overwrites data, the filesystem is always consistent -- there is no window where metadata can reference partially written data. This eliminates the need for a journal (as used by ext4 and XFS). Recovery after a crash is instantaneous: the filesystem reads the last committed root pointer and the tree is valid.

Free Snapshots

A snapshot in a COW filesystem is trivially created by saving a copy of the current root pointer. Since COW never modifies existing blocks, the snapshot's tree of blocks remains valid even as new writes create new blocks. Shared blocks are reference-counted; a block is freed only when all snapshots referencing it are deleted.

Snapshots are effectively free in both time (O(1) -- just save a pointer) and space (zero additional space until writes diverge the active filesystem from the snapshot).

Built-in Checksumming

COW filesystems typically checksum every block of data and metadata. When a block is read, its checksum is verified against the stored checksum. This detects bit rot (silent data corruption from aging media, cosmic rays, or firmware bugs) that journaling filesystems like ext4 cannot detect because they do not checksum data blocks.

ZFS

ZFS (originally by Sun Microsystems) is a combined filesystem and volume manager:

  • Storage pools (zpools): ZFS manages raw disks in a pool, abstracting away partitions and volumes. You add disks to a zpool, and ZFS manages space allocation automatically across all datasets (filesystems and volumes) in the pool.
  • ARC (Adaptive Replacement Cache): ZFS has its own sophisticated read cache that outperforms the Linux page cache for database and mixed workloads by combining LRU and LFU eviction policies.
  • Deduplication: ZFS can detect identical blocks across the pool and store only one physical copy. This is memory-intensive (requires a dedup table in RAM) but saves enormous space for workloads with duplicate data (VMs, backups).
  • RAID-Z (1, 2, 3): ZFS implements its own RAID levels that avoid the "write hole" problem of traditional hardware RAID. RAID-Z1 tolerates one disk failure, RAID-Z2 tolerates two, RAID-Z3 tolerates three.
  • Send/receive: ZFS can serialize a snapshot (or the delta between two snapshots) into a byte stream, send it to another machine, and reconstruct the snapshot there. This enables efficient incremental backups and replication.

Btrfs

Btrfs (B-tree filesystem) is Linux's native COW filesystem:

  • Subvolumes: Btrfs supports multiple independent filesystem trees (subvolumes) within a single filesystem. Each can have its own snapshots and can be mounted at different mount points.
  • Send/receive: Like ZFS, Btrfs supports incremental send/receive for efficient backup.
  • Transparent compression: Btrfs can compress data with zlib, LZO, or zstd on the fly, saving disk space and often improving read performance (less I/O, at the cost of CPU).
  • Online RAID: Btrfs supports RAID 0, 1, 10 natively, and experimental RAID 5/6.

Drawbacks of COW Filesystems

  • Fragmentation: because new blocks are always written to new locations, a file that is updated repeatedly becomes scattered across the disk. This is especially bad on HDDs; less impactful on SSDs.
  • Write amplification: updating a single data block requires rewriting the entire metadata path up to the root. For small random writes, this amplifies the total I/O.
  • Complexity: COW filesystems are significantly more complex than traditional filesystems, leading to longer development cycles and harder debugging.

ZFS Snapshots and Rollback

Real-World Example

Here is how ZFS snapshots work in practice:

Creating a snapshot (instantaneous):

$ zfs snapshot pool/data@before-upgrade
# Takes < 1 second regardless of dataset size
# No data is copied -- just saves the current root pointer

Verifying the snapshot:

$ zfs list -t snapshot
NAME                       USED  REFER
pool/data@before-upgrade     0B  50.2G

The snapshot uses 0 bytes initially because it shares all blocks with the live dataset.

After writing 5 GB of new data:

$ zfs list -t snapshot
NAME                       USED  REFER
pool/data@before-upgrade   5.1G  50.2G

The snapshot now "uses" 5.1 GB because that is how much old data it is preserving that the live dataset has since overwritten (with COW). The total disk usage increased by only the 5 GB of new data, not by 50.2 GB for a full copy.

Rolling back after a failed upgrade:

$ zfs rollback pool/data@before-upgrade
# All changes since the snapshot are discarded
# The live dataset reverts to the snapshot state

Incremental backup to a remote server:

$ zfs send -i pool/data@monday pool/data@tuesday | ssh backup-server zfs receive tank/data
# Only the blocks that changed between Monday and Tuesday are sent
# A 50 GB dataset with 500 MB of daily changes sends only 500 MB

Checksumming catches bit rot:

$ zpool scrub pool
# Reads every block, verifies checksums
# If a block is corrupted and a RAID-Z mirror exists, ZFS auto-repairs from the good copy
$ zpool status pool
  scan: scrub repaired 4K in 02:15:30
        1 data errors corrected

A traditional ext4 filesystem would silently return the corrupted data without any indication of an error.

Copy-on-Write: Update and Snapshot

COW: Updating Block D3 (Before and After) Before Update Root Inode D1 D2 D3 to update D4 After COW Update Root' Inode' shared D3' new copy Snapshot ptr old root COW Benefits Atomic updates crash-consistent no journal needed Free snapshots save old root ptr O(1) creation Checksumming detect bit rot self-healing RAID Drawbacks fragmentation write amplification
Step 1 of 2