How Write-Ahead Logging Works

The secret to high-performance durability. See how databases survive crashes without sacrificing speed using append-only logs.

Engine
RAM Buffer
WAL Buffer
WAL Disk
Data Disk
Replica
Current Action
Transaction Start
ACTIVE SESSION

Transaction Start

The user makes a change

A client sends a SQL write request (like UPDATE or INSERT). The database engine validates the syntax and permissions.

Why?

Consistency. We must ensure the transaction is valid before attempting to persist it.

Technical

Transaction ID (XID) is assigned. Row locks are acquired to prevent conflicting writes.

1 / 9
Landscape view recommended

Key Takeaways

Sequential I/O Wins

Appending to a log is orders of magnitude faster than random disk writes to scattered database pages.

Durability First

A transaction is only "committed" once its record is safely flushed to the WAL disk via fsync().

Crash Recovery

On reboot, the database replays the WAL to reconstruct any state missing from the main data files.

Write-Ahead Logging (WAL): The Engineering of Durability

Write-Ahead Logging is the fundamental mechanism that allows modern databases to maintain absolute data integrity without making every transaction agonizingly slow. It is the compromise that solved the "Durability Paradox."


Part 1: The Durability Paradox

In database engineering, we face a constant conflict between speed and safety. To ensure Durability (the 'D' in ACID), data must be written to non-volatile storage (disk) before a transaction is confirmed to the user.

Random I/O (The Slow Path)

Updating the actual data pages (B-Trees) requires jumping around the disk to find specific blocks. Even on fast NVMe SSDs, this is expensive and inefficient for high-concurrency workloads.

Sequential I/O (The WAL Path)

Instead of updating the data files, we simply append a record of the change to the end of a log file. Appending is blazingly fast because the disk head (or controller) never needs to search.

Part 2: The Golden Rule - "Write Ahead"

The name "Write-Ahead" is literal. The rule states: No data page should be modified on disk until the log record describing that change has been safely flushed.

Log Sequence Numbers (LSN)

In PostgreSQL and other engines, every log entry has a unique, monotonically increasing 64-bit identifier called an LSN. Data pages in memory also store the LSN of the last change made to them. The database refuses to write a page to disk if its LSN is greater than the last LSN flushed to the WAL.

Part 3: The fsync() Bottleneck

Even with sequential logging, there is one unavoidable slowing factor: fsync(). When a transaction commits, the database must wait for the Operating System to confirm the data is physically on the storage medium, not just in the kernel's write cache.

To combat this, modern databases use Group Commit. Instead of one fsync per user, the database waits a vĂ i milliseconds to bundle dozens of commits together and flushes them in a single high-speed disk operation.

Part 4: Checkpointing - Turning Logs into Data

If we never updated the actual data files, the WAL would eventually consume the entire disk. Furthermore, recovery after a crash would take hours as the database replayed years of logs.

Checkpointing is the process of periodically syncing the "dirty" pages in memory (which reflect the latest changes) down to the main database files on disk. Once a checkpoint completes, all previous WAL logs are no longer strictly needed for crash recovery and can be recycled or archived for backups.

Why not checkpoint constantly?

Frequent checkpoints keep recovery times low but introduce heavy "background noise" writes to the disk, which can pipe-line and slow down active user transactions. Finding the right "checkpoint interval" is a key task for any Database Administrator.

Part 5: Crash Recovery - The "Redo" Cycle

When a database restarts after an unexpected shutdown, it is in an "inconsistent" state. The main data files are old, and the RAM state is gone. It enters recovery mode:

  1. Analysis: Locate the last successful checkpoint in the logs.
  2. Redo: Starting from that checkpoint, read the WAL and "replay" every committed transaction. This brings the data in RAM back to its exact state second before the crash.
  3. Undo: Identify transactions that were in-progress during the crash but never committed, and roll back any partial changes they might have made.

Summary: The Engine of Trust

The Write-Ahead Log is what makes a database feel like a reliable mathematical truth rather than just a fragile piece of software. By separating the intent to change data from the physical reorganization of data, engineers built systems that are both high-performance and virtually uncrashable.

Glossary & Concepts

💾 fsync()

A system call that forces the operating system to flush kernel page caches to the physical storage device. This is the bottleneck of database performance.

🔢 LSN (Log Sequence Number)

A unique pointer to a specific location in the log. It ensures that the database never writes data to disk before the corresponding log record.

🏁 Checkpoint

A background process that flushes all "dirty" pages in RAM to disk, allowing older WAL files to be safely archived or deleted.

🚀 Group Commit

An optimization where the database waits a few milliseconds to batch multiple transaction flushes into a single fsync call.