How Journal File Systems Prevent Data Loss After Crashes
Journal file systems protect against data corruption caused by power loss or crashes by recording each write operation as a transaction in a dedicated log, then committing the changes only after the log is safely stored, enabling replay to restore consistency.
The key problem a file system must solve is preventing data corruption after power loss or system crashes. Such failures cause damage because file writes are not atomic; they involve both user data and metadata (superblock, inode bitmap, inode, data block bitmap), so an interruption can leave the system inconsistent.
A simplified write sequence involves:
Allocating a data block from the data block bitmap.
Adding a pointer to that block in the inode.
Writing user data into the block.
If any step is interrupted, various inconsistencies arise:
Step 2 completed, step 3 not: the file thinks it owns the block, but the block contains garbage.
Step 2 completed, step 1 not: metadata says the block is free while the file has claimed it, leading to possible double allocation.
Step 1 completed, step 2 not: a block is allocated but unused, wasting space.
Step 3 completed, step 2 not: user data is written but the file does not reference the block.
Journal file systems were created to solve these issues. Before performing the actual write, the system records each step as a transaction in a dedicated log (write‑ahead logging). Only after the log is safely stored does the system proceed to write metadata and user data to disk (checkpoint). If a crash occurs, the log is replayed on the next mount to restore consistency.
Because a log entry may be larger than the disk’s atomic write size (typically 512 bytes), each entry is terminated with a special end‑marker. Only entries with a valid end‑marker are considered complete; incomplete entries are discarded, ensuring the log contains only whole transactions.
Log space is limited and reused cyclically, so logs are often called circular logs. The journal workflow consists of:
Journal write – record the transaction in the log.
Journal commit – write the end‑marker after the log entry is safely stored.
Checkpoint – perform the real write of metadata and user data.
Free – reclaim the log space.
When both metadata and user data are logged (Data Journaling), each write is performed twice, which can halve performance, especially for large files. An alternative, Metadata (or Ordered) Journaling, logs only metadata; user data is written first, then the log, guaranteeing that a valid log implies valid user data. Most file systems, such as Linux EXT3, support both modes.
Reference: Crash Consistency – FSCK and Journaling.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.