Journaling Filesystems: Crash Consistency
A journaling filesystem maintains a special on-disk log called a journal (or write-ahead log) that records filesystem operations before they are committed to the main data structures. If the system crashes mid-operation (power failure, kernel panic), the journal allows the filesystem to recover quickly and consistently by replaying or discarding incomplete operations, rather than scanning the entire disk with fsck.
Why Journaling is Necessary
Consider what happens when the OS creates a new file without journaling:
- Allocate an inode in the inode table.
- Write the directory entry linking the filename to the inode.
- Allocate data blocks and update the block bitmap.
- Write the data.
If the system crashes between steps 2 and 3, the directory entry points to an inode that references unallocated blocks -- the filesystem is inconsistent. Without a journal, recovery requires running fsck, which scans every inode and block on the entire disk, potentially taking hours on large volumes.
Journal Modes
Journaling filesystems typically support multiple journal modes that trade off safety against performance:
| Mode | What is Journaled | Safety | Performance |
|---|---|---|---|
| journal (full) | Both metadata AND file data | Highest -- data and metadata are consistent after crash | Slowest -- all data is written twice (journal + final location) |
| ordered (default for ext4) | Only metadata, but data blocks are written to their final location BEFORE the metadata journal entry is committed | High -- metadata is consistent, and data is guaranteed to be on disk before the metadata references it | Good balance |
| writeback | Only metadata, data blocks can be written in any order | Lowest -- after crash, metadata may reference data blocks with stale or garbage content | Fastest |
ext4: The Linux Workhorse
ext4 (fourth extended filesystem) is the default filesystem on most Linux distributions. Key features:
- Extents: instead of tracking individual block pointers (as ext3 did), ext4 uses extents -- contiguous ranges of blocks described by (start block, length). A single extent can describe millions of contiguous blocks, dramatically reducing metadata overhead for large files and improving sequential I/O performance.
- Delayed allocation: ext4 does not allocate disk blocks immediately when a process writes data. Instead, it keeps data in the page cache and defers allocation until writeback time. This allows the allocator to see the total size of the write and allocate contiguous blocks, reducing fragmentation.
- Online defragmentation: ext4 supports
e4defragto defragment files while the filesystem is mounted and in use. - Checksumming: ext4 checksums the journal and metadata (group descriptors, inode table) to detect silent corruption.
- Maximum sizes: ext4 supports volumes up to 1 EB (exabyte) and files up to 16 TB with 4 KB blocks.
XFS: High-Performance at Scale
XFS was originally developed by SGI for IRIX and is now widely used on Linux for large-scale storage (RHEL default since 8.x):
- B+ tree directories: XFS uses B+ trees for directory entries instead of linked lists, making directory lookups O(log n) instead of O(n). This matters enormously for directories with millions of files.
- Allocation groups: XFS divides the filesystem into independent allocation groups (AGs), each with its own inode table, free space B+ tree, and metadata. Multiple threads can allocate inodes and blocks from different AGs in parallel without contention, enabling excellent scalability on multi-core systems with many concurrent writers.
- Extent-based allocation: like ext4, XFS uses extents, and has supported them since its inception (unlike ext4, which retrofitted them onto ext3).
- Real-time subvolume: XFS can reserve a section of the disk as a "real-time" subvolume with guaranteed allocation performance for latency-sensitive applications.
- Online grow: XFS can be grown (but not shrunk) while mounted.
Choosing Between ext4 and XFS
- ext4 is simpler, well-tested, and suitable for most general-purpose workloads. It supports online shrinking.
- XFS excels at handling very large files, high concurrency, and directories with millions of entries. It is preferred for database storage, large NAS systems, and media processing workloads.
Journal Recovery After a Crash
Here is how journaling prevents filesystem corruption:
Scenario: Renaming a file from old.txt to new.txt requires updating two directory entries. The filesystem must:
- Add "new.txt -> inode 500" to the directory.
- Remove "old.txt -> inode 500" from the directory.
Without journaling (crash between steps 1 and 2):
- Both "old.txt" and "new.txt" point to inode 500. The filesystem has a duplicate entry.
fsckmust scan and fix this, which takes time proportional to the disk size.
With journaling (ordered mode):
- Write a journal transaction describing both changes: "add new.txt->500, remove old.txt->500."
- Write a commit record to the journal, marking the transaction as complete.
- Apply the changes to the actual directory blocks.
- Mark the journal transaction as completed (free the journal space).
Crash scenarios:
- Crash before step 2: the transaction was never committed. On recovery, the journal is replayed and this incomplete transaction is discarded. The filesystem is unchanged -- rename never happened. Consistent.
- Crash between steps 2 and 3: the transaction is committed but not yet applied. On recovery, the journal is replayed: both directory changes are applied. Rename completes. Consistent.
- Crash after step 3: the transaction is complete. Recovery sees it is already applied. No action needed. Consistent.
Recovery time: Journal replay takes seconds (proportional to the journal size, typically 128-256 MB), not hours. This is why modern Linux systems boot cleanly even after an unexpected power loss.
Performance impact of journal mode:
data=journal: a 100 MB file write is written twice (200 MB I/O total). Safe but ~50% throughput penalty.data=ordered(default): the 100 MB is written once to its final location, then a small metadata journal entry (~4 KB) is written. Minimal overhead.data=writeback: same I/O as ordered, but metadata can be committed before data is on disk. Fastest but riskiest.