Write-Ahead Logging (WAL): The Engineering of Durability
Write-Ahead Logging is the fundamental mechanism that allows modern databases to maintain absolute data integrity without making every transaction agonizingly slow. It is the compromise that solved the "Durability Paradox."
Part 1: The Durability Paradox
In database engineering, we face a constant conflict between speed and safety. To ensure Durability (the 'D' in ACID), data must be written to non-volatile storage (disk) before a transaction is confirmed to the user.
Random I/O (The Slow Path)
Updating the actual data pages (B-Trees) requires jumping around the disk to find specific blocks. Even on fast NVMe SSDs, this is expensive and inefficient for high-concurrency workloads.
Sequential I/O (The WAL Path)
Instead of updating the data files, we simply append a record of the change to the end of a log file. Appending is blazingly fast because the disk head (or controller) never needs to search.
Part 2: The Golden Rule - "Write Ahead"
The name "Write-Ahead" is literal. The rule states: No data page should be modified on disk until the log record describing that change has been safely flushed.
Log Sequence Numbers (LSN)
In PostgreSQL and other engines, every log entry has a unique, monotonically increasing 64-bit identifier called an LSN. Data pages in memory also store the LSN of the last change made to them. The database refuses to write a page to disk if its LSN is greater than the last LSN flushed to the WAL.
Part 3: The fsync() Bottleneck
Even with sequential logging, there is one unavoidable slowing factor: fsync().
When a transaction commits, the database must wait for the Operating System to confirm
the data is physically on the storage medium, not just in the kernel's write cache.
To combat this, modern databases use Group Commit. Instead of one
fsync per user, the database waits a vĂ i milliseconds to bundle dozens
of commits together and flushes them in a single high-speed disk operation.
Part 4: Checkpointing - Turning Logs into Data
If we never updated the actual data files, the WAL would eventually consume the entire disk. Furthermore, recovery after a crash would take hours as the database replayed years of logs.
Checkpointing is the process of periodically syncing the "dirty" pages in memory (which reflect the latest changes) down to the main database files on disk. Once a checkpoint completes, all previous WAL logs are no longer strictly needed for crash recovery and can be recycled or archived for backups.
Why not checkpoint constantly?
Frequent checkpoints keep recovery times low but introduce heavy "background noise" writes to the disk, which can pipe-line and slow down active user transactions. Finding the right "checkpoint interval" is a key task for any Database Administrator.
Part 5: Crash Recovery - The "Redo" Cycle
When a database restarts after an unexpected shutdown, it is in an "inconsistent" state. The main data files are old, and the RAM state is gone. It enters recovery mode:
- Analysis: Locate the last successful checkpoint in the logs.
- Redo: Starting from that checkpoint, read the WAL and "replay" every committed transaction. This brings the data in RAM back to its exact state second before the crash.
- Undo: Identify transactions that were in-progress during the crash but never committed, and roll back any partial changes they might have made.
Summary: The Engine of Trust
The Write-Ahead Log is what makes a database feel like a reliable mathematical truth rather than just a fragile piece of software. By separating the intent to change data from the physical reorganization of data, engineers built systems that are both high-performance and virtually uncrashable.