How Journal File Systems Prevent Data Corruption After Crashes
Journal file systems use write‑ahead logging to record each write operation as a transaction, ensuring that after power loss or crashes the system can replay logs and maintain metadata and user‑data consistency, avoiding corruption and space waste through techniques like data, ordered, and metadata journaling.
A key problem that file systems must solve is how to prevent data damage caused by power loss or system crashes. In such unexpected events, the root cause of file system damage is that file writes are not atomic operations, because writing involves not only user data but also metadata (Superblock, inode bitmap, inode, data block bitmap, etc.). Therefore the write cannot be completed in one step; if any step is interrupted, data inconsistency or corruption occurs.
A simplified example: writing to a file involves the following steps:
Allocate a data block from the data block bitmap.
Add a pointer to that data block in the inode.
Write the user data into the data block.
If any of these steps is interrupted, different problems arise:
If step 2 completes but step 3 does not, the file thinks it owns the data block, but the block contains garbage data.
If step 2 completes but step 1 does not, the metadata becomes inconsistent: the file thinks the block is allocated, while the file system still considers it free, possibly leading to double allocation and data overwrite.
If step 1 completes but step 2 does not, the file system has allocated a block that no file uses, wasting space.
If step 3 completes but step 2 does not, user data is written to a block that the file does not recognize, effectively a wasted write.
Journal file systems were created to solve the above problems.
Their principle is to record all upcoming steps of a write operation (called a transaction) in a separate area of the file system before performing the actual write. This area is the journal (also known as write‑ahead logging). After the journal entry is safely stored, the system proceeds with the real write—updating metadata and user data on disk (called a checkpoint). If a power loss occurs during the write, the system can replay the saved journal on the next mount, avoiding the data‑corruption scenarios described earlier.
What if power loss happens while the journal itself is being saved? The initial idea of writing an entire log entry atomically is infeasible because disks write in 512‑byte units; a log larger than 512 bytes cannot be written in one operation. In practice, each log entry is given an end marker; only after the end marker is successfully written is the entry considered valid. Entries without an end marker are treated as invalid and discarded, ensuring that only complete log data is kept.
A log entry becomes unnecessary once its corresponding write operation finishes, and the occupied disk space can be reclaimed. Because the journal space is limited, it is used cyclically, hence the term “circular log.”
The workflow of a journal file system can be summarized as:
Journal write: write the transaction into the journal.
Journal commit: write the end marker after the log entry is safely stored.
Checkpoint: perform the actual write, storing metadata and user data in the file system.
Free: reclaim the space used by the journal entry.
Linux EXT3 supports Data Journaling, which records both metadata and user data in the journal. This approach has an efficiency drawback:
Each write operation causes both metadata and user data to be written twice—once to the journal and once to the file system. While metadata duplication is acceptable, duplicating large user data (e.g., gigabytes of video files) significantly reduces performance.
A more efficient method is Metadata Journaling (also called Ordered Journaling), which does not record user data in the journal. Instead, user data is written first, followed by the journal entry. As long as the journal is valid, the corresponding user data is also valid. If a crash occurs, the worst case is that the last journal entry is incomplete and its user data is lost, but the file system’s consistency and integrity remain guaranteed.
Most file systems adopt Metadata (Ordered) Journaling; Linux EXT3 can be configured to use either Data Journaling or Ordered Journaling.
Reference: Crash Consistency: FSCK and Journaling
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.