Post

Write-Ahead Logging & Durability

Before modifying any data file, sequentially append a log record to stable storage -- this makes atomic crash recovery possible without data loss. WAL is the architectural foundation of reliable databases and distributed systems.

Write-Ahead Logging & Durability

Before modifying any data file, sequentially append a log record to stable storage — this makes atomic crash recovery possible without data loss. WAL is the architectural foundation of reliable databases and distributed systems.

Why Sequential Writes Beat Random I/O

The core insight: Writing to the end of a file (sequential I/O) is 100-1000x faster than updating random positions in a data file (random I/O).

  • Sequential write to WAL: Append to end of log file, minimal disk head movement, ~1-2ms latency per fsync
  • Random write to data file (B-tree pages): Requires reading the page, modifying it, writing it back; multiple seeks across the disk platter, ~5-15ms per operation

By buffering data modifications in memory and writing them later to data files (asynchronously), WAL allows the database to acknowledge transactions quickly while still guaranteeing durability.

The Write Flow — Guaranteeing Durability

1
2
3
4
5
6
7
8
9
10
11
Transaction COMMIT
    ↓
Write to WAL buffer (in-memory)
    ↓
fsync() WAL to disk (blocks until kernel acks persistent storage)
    ↓
✅ Acknowledge client "transaction committed"
    ↓
Asynchronously apply changes to data files (buffered, may batch)
    ↓
Periodically checkpoint: mark which WAL records have been applied to disk

Key properties:

  • Durability is guaranteed by fsync to WAL, not by data file state
  • Data files lag behind WAL; during recovery, WAL is the source of truth
  • Checkpoint marks a “safe point”: before LSN X, all changes are in data files

ARIES Recovery Algorithm — Analysis, Redo, Undo

When a crash occurs and the database restarts, recovery proceeds in three phases:

Phase 1: Analysis

Read the WAL from the last checkpoint and identify:

  • All transactions that were active (not committed) at crash time
  • All transactions that committed after the last checkpoint (changes not yet in data files)
  • The dirty page table (which pages in the data files contain uncommitted changes)

Phase 2: Redo

Replay ALL changes to data files in log order, including uncommitted transactions. This reconstructs the state at crash time. Even though we’re redoing uncommitted work, we know the transaction ID; we’ll undo it in phase 3 if needed.

Why redo uncommitted transactions? Because we don’t know which pages were written to disk before the crash. To be safe, we replay everything and then undo the uncommitted transactions.

Phase 3: Undo

Scan the WAL backward and undo all transactions that were active at crash time (using undo logs stored in WAL). This restores the database to the last committed state.

Result: The database is consistent with all committed transactions, and all uncommitted work is rolled back.

Log Sequence Numbers (LSNs) and Checkpoints

Every WAL record has a Log Sequence Number (LSN) — a monotonically increasing integer representing its position in the log.

  • LSN 0: First record
  • LSN 1000: 1000 bytes into the log
  • etc.

When a page in the data file is modified, it’s tagged with the LSN of the modification. During recovery, if a page has LSN 5000 but the crash happened at LSN 10000, we know changes from LSN 5000-10000 need to be replayed.

Checkpoint record: Periodically, the database writes a checkpoint record (LSN X) that says “all changes before LSN X are safely in the data files.” On restart, recovery only needs to scan the WAL from the last checkpoint, not from the beginning.

Example:

  • Checkpoint at LSN 100000
  • Transactions at LSN 100100, 100200, 100300 commit
  • Crash at LSN 100350
  • Recovery: redo changes from LSN 100100-100350, undo any uncommitted transactions in that range

Durability Settings and Performance Trade-offs

Setting Behavior Data Loss Risk Latency Impact Used When
fsync=on (PostgreSQL synchronous_commit=on) WAL flushed to disk on each COMMIT ✅ None (durability guaranteed) 1-5ms per transaction Requires strong durability (financial, critical systems)
fsync=off (PostgreSQL synchronous_commit=off) WAL in OS buffer; kernel may not flush on crash within 200ms ❌ Up to 200ms data loss possible <1ms per transaction High-throughput, loss-tolerant (analytics, cache layers)
Group commit Batch multiple transactions’ WAL entries, fsync once ✅ None, but slightly higher latency 1-2ms per transaction (amortized) Default in MySQL, PostgreSQL (performance + safety)
WAL level (PostgreSQL) minimal: no WAL for bulk operations ❌ Bulk operations not recoverable 2x faster for bulk load Development, non-critical workloads
  replica: WAL for replication, but not all operations ✅ Safe for replication Standard Production, replication-enabled

Concrete numbers (PostgreSQL):

  • Single-transaction fsync: 1-5ms latency
  • Batch of 100 transactions, one fsync: 0.01-0.05ms per transaction (amortized)
  • OS buffer without explicit fsync: transactions visible in 200ms (kernel dirty_ratio timeout)

WAL as the Foundation for Replication

WAL is not just for crash recovery; it’s the foundation of all reliable distributed systems.

PostgreSQL Streaming Replication

PostgreSQL writes changes to the WAL. A replica connects to the primary and streams WAL records as they’re generated. The replica applies the same changes in order, staying consistent with the primary. If the primary crashes, the replica is only as far behind as the WAL streaming lag (typically <1 second).

Replication slots: A mechanism to tell the primary “don’t discard WAL beyond this LSN yet; my replica hasn’t applied it.” Without replication slots, a fast primary could discard WAL faster than a slow replica can apply it, causing replication to break.

Kafka — A Distributed WAL

Kafka is fundamentally a distributed WAL. Producers write events to a topic (partition), and each partition is a log of events with offsets (equivalent to LSNs). Consumers read from an offset and apply events in order. If a consumer crashes, it resumes from the last committed offset. Kafka’s guarantees (at-most-once, at-least-once, exactly-once) are all based on managing offsets relative to the log.

MySQL Binlog

MySQL writes all changes to the binlog (similar to WAL). Replicas connect and replay binlog events. The binlog is also used for point-in-time recovery: backup the data files + binlog from 3 days ago, then replay the binlog up to any point in time to recover the database state.

Real Systems — WAL Usage Patterns

PostgreSQL pg_wal

PostgreSQL writes to the pg_wal/ directory with segment files (each 16MB by default, e.g., 000000010000000000000001). Each transaction appends its changes to the current segment; once a segment fills, a new one is created. The background writer process periodically flushes modified pages to the main data directory. A checkpoint marks a segment as “safe to recycle.” Old segments are kept until they’re no longer needed for replication or recovery.

Replication slots are configured per replica; each slot tracks the oldest WAL segment the replica hasn’t consumed yet. Slots can cause WAL to accumulate on disk (consuming storage) if a replica falls behind or crashes.

Concrete numbers: A high-throughput system (10k transactions/sec) generates ~10GB of WAL per hour. A 7-day retention requires ~1.6TB of WAL storage.

MySQL InnoDB Redo Log

InnoDB maintains a fixed-size redo log (default 48MB, configurable). Changes are written to the redo log in a circular manner: once the log fills, older entries are overwritten (safe only if those changes have been flushed to data files via a checkpoint). The redo log also includes a doublewrite buffer (2MB) on disk: before writing a page to the main data files, InnoDB writes it to the doublewrite buffer first. If a crash occurs mid-page-write (torn page), the doublewrite buffer provides a safe copy to recover from.

Concrete numbers: InnoDB redo log fsync latency is typically 0.5-2ms. A server with innodb_log_buffer_size=16MB can buffer many transactions before fsync. Checkpoints happen when the redo log is 75% full (configurable); if redo log fills before checkpoint completes, writes stall.

RocksDB WAL

RocksDB (an embedded LSM-tree database) writes changes to an in-memory memtable. To protect against crashes, every write is first appended to a WAL file on disk. Once the memtable fills, it’s flushed to an SSTable file (immutable sorted file on disk), and the WAL is rotated. The WAL is not critical for durability once an SSTable is created (since the SSTable is persistent), but it’s essential during recovery: if a crash occurs with a partial memtable, the WAL replays all changes to that memtable.

RocksDB WAL can be disabled for read-only workloads or high-throughput cache layers (trading durability for performance).

Kafka Distributed Commit Log

Kafka’s commit log is distributed across a cluster of broker nodes. A partition’s log is replicated to N brokers (replication factor, typically 3). When a producer sends a message, it’s written to the leader broker’s log (WAL) and replicated to follower brokers. The producer’s request is acknowledged based on the acks setting:

  • acks=1: Ack after leader fsync (1-2ms latency, 1-broker failure tolerance)
  • acks=all: Ack after all in-sync replicas (ISRs) fsync (5-10ms latency, full replication factor tolerance)

Concrete numbers: A Kafka cluster can handle 1M messages/sec with acks=all at ~10ms end-to-end latency.

SQLite WAL Mode

SQLite supports two durability modes: journal mode (default) and WAL mode. In journal mode, changes are written to a rollback journal before modifying data files (traditional WAL approach). In WAL mode, a separate WAL file is used for all changes, and data files are updated lazily. WAL mode allows concurrent readers while writes are happening (readers see the last checkpoint, writers use the WAL), which journal mode doesn’t support. Checkpoints happen periodically (every 1000 pages by default), moving committed changes from WAL to the main data file.

Key Properties Table

Property Value
PostgreSQL WAL segment size 16MB (fixed)
PostgreSQL checkpoint frequency Configurable, typically 30 min or 1GB WAL written
MySQL InnoDB redo log size Default 48MB (can be 4GB+)
MySQL InnoDB checkpoint lag Redo log 75% full triggers checkpoint
RocksDB memtable size ~64MB (flushed to SSTable when full)
Kafka commit latency (acks=all) 5-10ms
Kafka commit latency (acks=1) 1-2ms
SQLite WAL checkpoint Every 1000 pages (~4MB) by default
fsync latency (NVMe SSD) 0.5-2ms
fsync latency (spinning disk) 5-15ms
Group commit throughput gain 10-100x (amortized fsync across many txns)

When to Use Strong Durability vs Eventual Durability

✅ Strong Durability (fsync=on):

  • Financial transactions, banking systems
  • Medical records, legal documents
  • Critical state that cannot be reconstructed

✅ Eventual Durability (fsync=off, or OS buffer):

  • Analytics pipelines (data can be re-ingested)
  • Cache layers (backed by source of truth)
  • High-throughput event streams (loss acceptable at <200ms scale)
  • Real-time dashboards (stale data acceptable)

Implementation: WAL Recovery Simulation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
import logging
from enum import Enum
from typing import List, Dict, Optional

class TransactionStatus(Enum):
    """Transaction states."""
    ACTIVE = "ACTIVE"
    COMMITTED = "COMMITTED"
    ABORTED = "ABORTED"

class WALRecord:
    """A single Write-Ahead Log record."""

    def __init__(self, lsn: int, txn_id: int, operation: str, data: str, status: TransactionStatus):
        """
        Args:
            lsn: Log Sequence Number (monotonically increasing)
            txn_id: Transaction ID
            operation: "INSERT", "UPDATE", "DELETE", "COMMIT", "ABORT", "CHECKPOINT"
            data: Serialized change (e.g., "table=users, key=42, value=Alice")
            status: ACTIVE, COMMITTED, ABORTED
        """
        self.lsn = lsn
        self.txn_id = txn_id
        self.operation = operation
        self.data = data
        self.status = status

    def __repr__(self):
        return f"WAL(LSN={self.lsn}, TXN={self.txn_id}, OP={self.operation}, STATUS={self.status})"

class WALRecovery:
    """
    Simulates ARIES recovery algorithm: Analysis, Redo, Undo.
    """

    def __init__(self):
        self.wal_log: List[WALRecord] = []
        self.data_files: Dict[str, str] = {}  # key -> value
        self.last_checkpoint_lsn: int = 0
        self.transaction_table: Dict[int, TransactionStatus] = {}  # txn_id -> status

    def append_wal(self, record: WALRecord):
        """Append a record to the WAL."""
        self.wal_log.append(record)
        print(f"[WAL] Appended: {record}")

    def checkpoint(self, lsn: int):
        """
        Create a checkpoint at LSN.
        Simulates flushing all data modified before this LSN to disk.
        """
        self.last_checkpoint_lsn = lsn
        self.append_wal(WALRecord(lsn, -1, "CHECKPOINT", f"flushed_to_lsn={lsn}", TransactionStatus.COMMITTED))
        print(f"[CHECKPOINT] Checkpoint at LSN={lsn}\n")

    def commit_transaction(self, txn_id: int):
        """Simulate committing a transaction."""
        self.transaction_table[txn_id] = TransactionStatus.COMMITTED
        self.append_wal(WALRecord(len(self.wal_log), txn_id, "COMMIT", "", TransactionStatus.COMMITTED))
        print(f"[COMMIT] Transaction {txn_id} committed\n")

    def simulate_crash(self):
        """Simulate a database crash. Return the crash LSN."""
        crash_lsn = len(self.wal_log) - 1
        print(f"\n💥 [CRASH] Database crashed at LSN={crash_lsn}\n")
        return crash_lsn

    def recovery_analysis(self, crash_lsn: int) -> Dict[int, TransactionStatus]:
        """
        Phase 1 (Analysis): Scan WAL from last checkpoint to crash LSN.
        Identify active transactions (started but not committed).
        """
        print(f"[RECOVERY] Phase 1: ANALYSIS (from checkpoint LSN={self.last_checkpoint_lsn} to crash LSN={crash_lsn})")

        active_transactions = {}
        for record in self.wal_log[self.last_checkpoint_lsn:crash_lsn + 1]:
            if record.operation in ("INSERT", "UPDATE", "DELETE"):
                # Data modification — transaction is active
                if record.txn_id not in active_transactions:
                    active_transactions[record.txn_id] = TransactionStatus.ACTIVE
                print(f"  - Found active transaction {record.txn_id}: {record}")

            elif record.operation == "COMMIT":
                active_transactions[record.txn_id] = TransactionStatus.COMMITTED
                print(f"  - Transaction {record.txn_id} committed: {record}")

            elif record.operation == "ABORT":
                active_transactions[record.txn_id] = TransactionStatus.ABORTED
                print(f"  - Transaction {record.txn_id} aborted: {record}")

        print(f"  Active transactions at crash: {[t for t, s in active_transactions.items() if s == TransactionStatus.ACTIVE]}\n")
        return active_transactions

    def recovery_redo(self, crash_lsn: int):
        """
        Phase 2 (Redo): Replay all changes (committed and uncommitted) to reconstruct state.
        """
        print(f"[RECOVERY] Phase 2: REDO (from checkpoint to crash LSN={crash_lsn})")

        for record in self.wal_log[self.last_checkpoint_lsn:crash_lsn + 1]:
            if record.operation in ("INSERT", "UPDATE", "DELETE"):
                # Re-apply the change (we'll undo uncommitted ones in phase 3)
                print(f"  - Redo: {record}")
                # For simulation, we store it directly
                if "key=" in record.data and "value=" in record.data:
                    parts = record.data.split(", ")
                    key = next((p.split("=")[1] for p in parts if "key=" in p), None)
                    value = next((p.split("=")[1] for p in parts if "value=" in p), None)
                    if key and value:
                        self.data_files[key] = value
                        print(f"    Applied: {key} = {value}")

        print(f"  Data files after redo: {self.data_files}\n")

    def recovery_undo(self, active_transactions: Dict[int, TransactionStatus]):
        """
        Phase 3 (Undo): Scan WAL backward, undo changes from uncommitted transactions.
        """
        print(f"[RECOVERY] Phase 3: UNDO (uncommitted transactions)")

        to_undo = [t for t, s in active_transactions.items() if s == TransactionStatus.ACTIVE]
        print(f"  Transactions to undo: {to_undo}")

        # Scan WAL backward
        for record in reversed(self.wal_log):
            if record.txn_id in to_undo and record.operation in ("INSERT", "UPDATE", "DELETE"):
                print(f"  - Undo: {record}")
                # For simplicity, we remove the entry (simulating undo)
                if "key=" in record.data:
                    parts = record.data.split(", ")
                    key = next((p.split("=")[1] for p in parts if "key=" in p), None)
                    if key and key in self.data_files:
                        del self.data_files[key]
                        print(f"    Removed: {key}")

        print(f"  Data files after undo: {self.data_files}\n")

    def full_recovery(self):
        """Execute full ARIES recovery (Analysis, Redo, Undo)."""
        crash_lsn = self.simulate_crash()
        active_transactions = self.recovery_analysis(crash_lsn)
        self.recovery_redo(crash_lsn)
        self.recovery_undo(active_transactions)
        print("[RECOVERY] ✅ Recovery complete. Database is consistent with last committed state.\n")


# Example: Simulate WAL and crash recovery
if __name__ == "__main__":
    recovery = WALRecovery()

    # Transaction 1: Insert user Alice
    recovery.append_wal(WALRecord(0, 1, "INSERT", "table=users, key=1, value=Alice", TransactionStatus.ACTIVE))
    recovery.commit_transaction(1)

    # Checkpoint
    recovery.checkpoint(2)

    # Transaction 2: Insert user Bob (committed)
    recovery.append_wal(WALRecord(3, 2, "INSERT", "table=users, key=2, value=Bob", TransactionStatus.ACTIVE))
    recovery.commit_transaction(2)

    # Transaction 3: Update user Alice to Alicia (uncommitted — will be lost)
    recovery.append_wal(WALRecord(5, 3, "UPDATE", "table=users, key=1, value=Alicia", TransactionStatus.ACTIVE))

    # Simulate crash (transaction 3 never commits)
    recovery.full_recovery()

    print("=" * 60)
    print("Final data (after recovery):")
    print(f"  users: {recovery.data_files}")
    print("  Note: User 1 is still 'Alice' (update from TXN 3 was undone)")

References

This post is licensed under CC BY 4.0 by the author.