Post

Replication Strategies

Replicate data across multiple nodes to improve read scalability, availability, and disaster recovery -- choose your topology (single-leader, multi-leader, leaderless) and sync mode based on your durability/latency trade-offs.

Replication Strategies

Key Properties

Property Single-Leader Multi-Leader Leaderless (Quorum)
Write availability ❌ Primary down = no writes ✅ Write to any leader ✅ Write to N nodes
Write conflicts ❌ None 🔶 Complex to resolve ❌ None (quorum ensures consistency)
Replication lag 10ms–1sec 10ms–1sec < 10ms (read from R nodes)
Failover time 30sec–2min (detect + promote) Automatic (any leader active) Automatic (quorum picks latest)
Consistency model Strong (primary) / Eventual (replicas) Eventual (CRDTs or LWW) Strong (W+R>N) / Eventual (W+R≤N) (see CAP Theorem)
Example systems PostgreSQL, MySQL, MongoDB CouchDB, MySQL multi-source Cassandra, DynamoDB, Riak
Use case OLTP (banking, e-commerce) Geo-distributed, offline-capable Analytics, high write throughput

Replication Fundamentals

How Replication Works

All replication systems follow the same basic pattern:

  1. Write goes to leader(s) — Client sends write to primary/leader node(s).
  2. Leader applies write to local state — Update is written to leader’s database.
  3. Leader sends change to replicas — Write operation (or full state change) is sent to replica nodes.
  4. Replicas apply change — Each replica updates its local copy.
  5. Client is acknowledged — After replicas confirm (or immediately in async mode), client gets ACK.

The key variation is when the client is acknowledged:

  • Synchronous: After replicas apply the write (durability guaranteed, but higher latency)
  • Asynchronous: Immediately after leader applies (low latency, but data loss possible if leader crashes)
  • Semi-synchronous: After leader + one replica apply (compromise)

Single-Leader (Primary-Follower) Replication

Architecture

One primary (leader) node accepts all writes. Multiple replica (follower) nodes asynchronously replicate the primary’s changes by consuming a replication log (PostgreSQL: WAL, MySQL: binlog, MongoDB: oplog).

1
2
3
4
5
6
7
Client writes → Primary (Pennsylvania)
                  ↓ (WAL/binlog log shipped)
              Replica 1 (Ohio)
              Replica 2 (Virginia)

Reads can go to replicas (if eventual consistency OK)
All writes MUST go to primary

How Replication Works in Detail

PostgreSQL Streaming Replication

PostgreSQL ships the Write-Ahead Log (WAL) to replicas. Every transaction is written to the WAL on disk before the database heap pages are modified. Replicas connect to the primary and consume the WAL stream:

  1. Primary receives INSERT INTO users VALUES (1, 'alice').
  2. Primary writes this to the WAL file (000000010000000000000001).
  3. Primary modifies heap pages in memory/disk.
  4. Replica reads from WAL stream: LSN=0/16001234 (Log Sequence Number)
  5. Replica replays the WAL: applies the INSERT to its own heap.
  6. Replica updates its own LSN to confirm it’s replayed up to that point.

Synchronous vs Asynchronous:

  • Asynchronous (default): Primary acknowledges immediately after writing WAL locally. Replica replication happens in the background. If primary crashes before replica replicates, data is lost.
  • Synchronous: Primary waits for replica to acknowledge it has written the WAL before responding to client. Guarantees replica has durable copy before client succeeds. Adds 10–50ms latency (round-trip to replica and back).

Real production numbers: A medium-sized PostgreSQL primary (100GB database, 10,000 QPS) generates ~500MB of WAL per hour. Streaming to a replica in the same datacenter incurs ~1ms latency; shipping to another continent adds 50–200ms depending on network quality.

MySQL Binlog Replication

MySQL records all data-modifying statements (or row changes) in the binlog. Replicas connect to the primary and download binlog events:

1
2
3
4
5
6
7
Primary:   SELECT * FROM orders; (position 154)
           INSERT INTO orders VALUES (...); (position 289)
           → sends to replica at position 154

Replica:   Connects at position 154, receives INSERT event
           Applies INSERT to its local database
           Updates relay log position to 289

Gotcha — statement-based vs row-based:

  • Statement-based (SBR): Replicate the actual SQL statement. Fast for large bulk operations. Problem: non-deterministic functions (NOW(), RAND()) may produce different results on replica.
  • Row-based (RBR): Replicate the actual row changes (INSERT these specific columns). Larger binlog (every row change becomes an event). Deterministic — replica always produces identical results. Most production systems use RBR.

Real production numbers: A high-write MySQL primary (50,000 INSERTs/sec, row-based replication) generates ~2GB of binlog per hour. Replicas lag behind by 10–100ms depending on network congestion.

Read-Your-Writes Consistency Problem

Single-leader replication has a consistency gap: replicas lag behind the primary.

Scenario:

1
2
3
4
T1: Client writes to primary: UPDATE profile SET status='online'
T2: (1ms later) Primary returns OK
T3: Client immediately reads from replica: SELECT status FROM profile
T4: Replica hasn't replicated yet, so client sees status='offline' (stale!)

Solutions:

  1. Read-after-write consistency: For user’s own data, always read from primary.
    1
    2
    3
    4
    5
    
    if (user_id == current_user_id) {
        read from primary  // Guarantee we see our own writes
    } else {
        read from replica  // Eventual consistency OK for others' data
    }
    
  2. Timestamp-based read routing: Client provides write timestamp; replica waits until it has replicated that timestamp before serving reads.
    1
    2
    
    primary_version = client.write(data)  // returns version 42
    replica.read_with_version(42)         // wait for replica to reach version 42
    
  3. Session-based consistency: Use sticky sessions; route all requests from same client to same replica, which guarantees it has replicated all previous writes.

Failover Challenges

When the primary crashes, the system must detect failure and promote a replica to primary.

Failure detection latency: How fast does the system realize the primary is down?

  • Active health checking (ping) can detect failure in 100–500ms.
  • In production, you usually see 30–60 seconds of silence before manual intervention or timeout kicks in.

Failover process:

  1. Detect primary failure (30–120 seconds).
  2. Choose which replica to promote (should have latest replication LSN).
  3. Promote replica to primary (a few seconds).
  4. Redirect clients to new primary (seconds to minutes).
  5. Catch up other replicas to new primary.

Total downtime: 30 seconds (detection) + 10 seconds (promotion) + 30 seconds (client redirect) = ~70–90 seconds in a well-automated system.

Data loss scenario: If the primary crashes after writing to local disk but before replicating to any replica (asynchronous mode), committed data is lost. The newly promoted replica won’t have that write. In synchronous mode, you guarantee at least one replica has the data, so failover has zero data loss (but higher write latency).


Multi-Leader (Multi-Master) Replication

Architecture

Multiple nodes are active leaders. Each leader accepts writes; leaders replicate their changes to each other.

1
2
3
Leader A (San Francisco) ←→ Leader B (London)
  ↓                         ↓
Replica A1 (US)        Replica B1 (EU)

Both leaders can serve write traffic simultaneously. This is critical for geo-distributed systems where sending all writes across the Atlantic to a single primary adds 50–100ms latency.

Write Conflict Resolution

With multiple leaders, the same data can be written on different leaders simultaneously:

Scenario:

1
2
3
4
T1: User in SF writes to Leader A:  UPDATE profile SET status='online'
T2: User in London writes to Leader B: UPDATE profile SET status='offline'
T3: Leaders replicate their changes to each other
T4: Now both leaders have conflicting values!

Resolution strategies:

  1. Last Write Wins (LWW): Each write has a timestamp. Later timestamp wins. Problem: data loss (one write discarded). Acceptable for non-critical data (user status, cache).

  2. CRDT (Conflict-free Replicated Data Type): Data structure that is designed so concurrent writes always merge correctly.
    • Counter: Operations are commutative; +1 on A and +1 on B both result in +2 regardless of merge order.
    • Grow-only set: Only supports add, not remove. Conflict-free by definition.
    • Register with metadata: Each write includes a unique replica ID; merge by comparing metadata (e.g., replica ‘A’ beats replica ‘B’ lexicographically).
  3. Application logic / Manual resolution: Detect conflicts and ask application/user to resolve.

Use Cases

  • Offline-capable apps (CouchDB on mobile): Changes are replicated when connection is available; conflicts are resolved via CRDTs or application logic.
  • Geo-distributed systems (multi-datacenter masters): Each datacenter has a leader for local write latency. Conflicts rare if data is partitioned by geography (different users in different regions).
  • Collaborative editing (Google Docs): Uses OT (Operational Transformation) or CRDTs to merge concurrent edits.

Real Systems

  • CouchDB: Supports multi-leader replication with CRDTs and application-level conflict handlers.
  • MySQL multi-source replication: Multiple primary servers can replicate to each other; application must handle conflicts or use row-level locking.
  • Riak: Every node is a leader; uses CRDTs and vector clocks to resolve conflicts.

Leaderless (Quorum-based) Replication

Architecture

Every node is equal. Writes go to multiple nodes in parallel, read from multiple nodes in parallel. No master/follower distinction.

1
2
3
4
5
6
7
Client writes to all N nodes concurrently:
  Write(key='user:42', value='alice')  →  [Node 1, Node 2, Node 3]
  Wait for W acknowledgments (W ≤ N)

Client reads from R nodes concurrently:
  Read(key='user:42')  →  [Node 1, Node 3]  (Pick latest by version/timestamp)
  Return highest version

Quorum Mathematics

The quorum intersection property ensures that reads and writes overlap (see Consensus & Quorum for detailed quorum theory):

If W + R > N, then every read overlaps with every write.

1
2
3
4
Example: N=3, W=2, R=2
  Write to 2 of 3 nodes → at least one node has the write
  Read from 2 of 3 nodes → at least one of those 2 has the write
  Proof: With 3 total nodes, 2+2=4 > 3, so at least one node is in both sets

Consistency table:

N W R Consistency Example Use
3 1 3 Strong (R=N) Read from all nodes; pick latest
3 2 2 Strong (W+R>N) Cassandra “quorum”
3 3 1 Strong (W=N) Write all before ACK; read any
3 1 1 Weak (W+R≤N) Write one; read any (may be stale)

Read Repair & Anti-Entropy

Nodes may temporarily have stale data. Recovery happens via:

  1. Read repair: During a read, if one node has stale data and another has newer data, the stale node is updated immediately.
  2. Anti-entropy (background repair): Periodically compare all replicas and fix mismatches (Merkle tree hashing).

Use Cases

  • High write throughput: Write to W nodes in parallel (typically W=1 or W=2); asynchronously propagate to remaining nodes.
  • High availability: No single point of failure (no primary); any node can accept writes/reads.
  • Acceptable eventual consistency: Can tolerate reads returning slightly stale data if nodes are recovering.

Synchronous vs Asynchronous Replication

Synchronous Replication

Definition: Primary waits for replicas to acknowledge before responding to client.

Durability: ✅ Extremely high — client ACK guarantees replicas have written to disk.

Latency: 🔶 Higher — must wait for replica’s round-trip (10–100ms within datacenter, 100–500ms cross-continent).

Availability: ❌ If even one replica is slow/down, all writes are blocked.

Real-world impact: A synchronous replicated write on a primary in San Francisco with one replica in New York incurs ~50ms network round-trip. If you’re doing 1000 writes/sec, and each takes 50ms, you can only sustain 20 writes/sec (50ms * 1000 writes = 50 seconds). Throughput collapses.

Asynchronous Replication

Definition: Primary acknowledges immediately after writing to its own disk; replication happens in background.

Durability: ❌ Weak — if primary crashes before replicating, data is lost.

Latency: ✅ Very low — no waiting for replicas (~1–5ms to write locally).

Availability: ✅ Replicas can be down/slow without affecting writes.

Real-world impact: Same scenario (SF primary, NY replica). Each write takes ~1ms (local disk write only). You can sustain 10,000+ writes/sec. Asynchronously, replicas catch up in the background (typically within 100ms).

Semi-Synchronous Replication

Definition: Primary waits for one replica to acknowledge, then immediately responds to client. Other replicas replicate asynchronously in background.

Trade-off: Compromise between sync and async.

  • Durability is strong: at least one replica has the write.
  • Latency is reasonable: only waiting for one replica (~10–20ms).
  • Availability is reasonable: one replica can be down without blocking writes.

Real production: PostgreSQL, MySQL, and MongoDB typically use semi-synchronous by default (wait for one follower before primary ACK).


How Real Systems Use Replication

PostgreSQL Streaming Replication

PostgreSQL uses synchronous or asynchronous WAL streaming based on synchronous_commit setting. A typical production setup: Primary in us-east-1, synchronous replica in us-east-1 (same AZ, ~1ms latency), and asynchronous replica in us-west-2 (cold standby for disaster recovery, no ACK impact).

Configuration:

1
synchronous_commit = 'remote_apply'  -- Wait for replica to apply WAL

Behavior: Client write → Primary writes WAL → Primary sends WAL to sync replica → Sync replica writes and applies WAL → Sync replica ACK → Primary ACK to client. Total latency: ~5–15ms (1ms local write + 1ms network + 1ms replica write). Async replica eventually catches up (100ms lag typical). If primary crashes, promotion takes ~30 seconds (manual or automated tools like Patroni). If sync replica is overloaded, all writes slow down; this is why most systems prefer semi-sync (wait for only one replica to write WAL, not apply).

MySQL Binlog Replication

MySQL replicates binlog events. A typical production setup: Primary in us-east, replicas in us-east (for reads), and us-west (for geo-locality or warm standby). Asynchronous by default (client ACK after primary binlog write, replica lags 10–100ms).

The relay log on replica buffers the binlog from primary; a separate SQL thread applies relay log changes to the replica’s database. If this SQL thread falls behind the IO thread, replica lag increases (can reach minutes if applying complex transactions).

Production numbers: A primary handling 50,000 QPS generates ~2GB binlog/hour. Replicas with sufficient disk I/O stay within 100ms lag. If a replica is slow (I/O bound), lag can hit seconds; monitoring tools alert if lag > 10 seconds.

Cassandra Leaderless Quorum Replication

Cassandra is fully leaderless and eventually consistent. Every node is equivalent. When you write a key:

1
2
3
4
5
6
7
8
9
10
11
Write request → Cassandra coordinator (any node)
Coordinator determines N=3 replica nodes (by partition key hash)
Coordinator sends write to all 3 replicas in parallel
Coordinator waits for W=1 (by default) ACK
Client ACK after W ACKs received

Meanwhile, replicas that didn't ACK in time (due to network delay)
receive the write via **hinted handoff**:
  Coordinator caches write for slow replica
  Periodically retries sending the write
  Slow replica eventually catches up

Consistency tuning:

  • W=1, R=1 (fastest writes, eventual consistency) — acceptable for cache/events
  • W=2, R=2 (balanced) — good for most data
  • W=3, R=3 (strong consistency) — use for critical data (but slower)

Conflict resolution: Cassandra uses timestamp (for last-write-wins) or CRDTs. If two writes happen to the same key on different nodes at the same time, the one with the higher timestamp wins; the other is discarded. This is LWW — acceptable for counters, user preferences, not for financial data.

DynamoDB Global Replication

DynamoDB replicates across multiple AWS regions for disaster recovery and geo-locality. You create a global table that replicates across regions:

1
2
3
Write to us-east-1 → Local commit → Replicate to eu-west-1, ap-south-1
                       ↓ (eventually consistent)
                   Replicas eventually receive write (100ms–1sec)

By default, DynamoDB uses eventual consistency (read may return stale data from 100ms ago). You can request strongly consistent reads on a single region (costs 2x read units, adds latency).

Write conflict resolution: DynamoDB uses LWW with application-defined timestamps. Application provides the timestamp; highest timestamp wins. This is better than server timestamp because application can ensure causality (e.g., “write after transaction T100 must have timestamp > T100’s timestamp”).

MongoDB Replica Sets

MongoDB uses single-leader replication with automatic failover. A replica set is 3+ nodes: 1 primary + 2+ secondaries. When primary crashes, secondaries automatically elect a new primary (via Raft consensus).

1
2
3
4
5
6
7
8
9
Client writes to primary
Primary writes to oplog (operation log)
Secondaries read from primary oplog, apply operations locally

Read preference options:
  - primary: only read from primary (strong consistency)
  - primary_preferred: read from primary, fallback to secondary (fast)
  - secondary: read from secondary only (fast reads, stale data)
  - secondary_preferred: read from secondary, fallback to primary (balanced)

Replication lag: A busy primary might generate oplog entries faster than secondaries can apply. Lag is measured in bytes (how far behind in the oplog the secondary is). MongoDB can handle 10–100ms lag easily; if lag exceeds 1 second, alerts fire (secondary I/O is overloaded).

Failover: When primary crashes, secondaries detect it (heartbeat failure) within 10 seconds and hold an election. Any secondary with the latest oplog version becomes primary. Total downtime: 10–20 seconds (detection) + 5–10 seconds (election) = ~15–30 seconds. No data loss if the primary had written to its oplog before crashing.


Implementation — Quorum Read/Write Consistency Simulator

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
from typing import Dict, List, Optional, Tuple
from collections import defaultdict
import time

class QuorumReplica:
    """
    Represents a single replica node in a leaderless system.

    Stores versioned values; newer versions overwrite older ones.
    """

    def __init__(self, node_id: int):
        """
        Initialize a replica node.

        Args:
            node_id: unique identifier for this replica
        """
        self.node_id = node_id
        self.data: Dict[str, Tuple[int, any]] = {}  # key -> (version, value)

    def write(self, key: str, value: any, version: int) -> bool:
        """
        Write a key-value pair with a version.

        Returns True if write is applied (value is newer than current).
        """
        if key not in self.data or self.data[key][0] < version:
            self.data[key] = (version, value)
            return True
        return False

    def read(self, key: str) -> Optional[Tuple[int, any]]:
        """
        Read a key-value pair with its version.

        Returns (version, value) if key exists, None otherwise.
        """
        return self.data.get(key)

    def __repr__(self):
        return f"Replica(node={self.node_id}, keys={len(self.data)})"


class QuorumCluster:
    """
    Simulates a leaderless cluster with quorum-based read/write.

    Parameters:
    - N: total replicas
    - W: replicas required for write ACK
    - R: replicas to read from for strong consistency
    """

    def __init__(self, num_replicas: int, write_quorum: int, read_quorum: int):
        """
        Initialize a quorum cluster.

        Args:
            num_replicas (N): total number of replica nodes
            write_quorum (W): replicas required for write ACK
            read_quorum (R): replicas to read from
        """
        self.N = num_replicas
        self.W = write_quorum
        self.R = read_quorum
        self.replicas = [QuorumReplica(i) for i in range(num_replicas)]
        self.version_counter = 0

        # Verify quorum property
        if self.W + self.R <= self.N:
            print(f"Warning: W+R ({self.W}+{self.R}) <= N ({self.N})")
            print("   Reads may return stale data!")

    def write(self, key: str, value: any, target_replicas: Optional[List[int]] = None) -> bool:
        """
        Write to quorum of replicas.

        Args:
            key: key to write
            value: value to write
            target_replicas: list of replica IDs to write to (None = all)

        Returns:
            True if W replicas acknowledged, False otherwise
        """
        if target_replicas is None:
            target_replicas = list(range(self.N))

        self.version_counter += 1
        version = self.version_counter

        acks = 0
        for replica_id in target_replicas[:self.W]:  # Write to first W
            if self.replicas[replica_id].write(key, value, version):
                acks += 1

        success = acks >= self.W
        status = "ACK" if success else "FAILED"
        print(f"  Write {key}={value} (v{version}) to replicas {target_replicas[:self.W]}: {status}")

        return success

    def read(self, key: str, source_replicas: Optional[List[int]] = None) -> Optional[any]:
        """
        Read from quorum of replicas; return highest version.

        Args:
            key: key to read
            source_replicas: list of replica IDs to read from (None = all)

        Returns:
            Latest value if found in at least R replicas, None otherwise
        """
        if source_replicas is None:
            source_replicas = list(range(self.N))

        responses = []
        for replica_id in source_replicas[:self.R]:  # Read from first R
            data = self.replicas[replica_id].read(key)
            if data:
                responses.append((data[0], data[1], replica_id))

        if not responses:
            print(f"  Read {key}: NOT FOUND")
            return None

        # Return highest version
        latest_version, latest_value, source = max(responses, key=lambda x: x[0])
        print(f"  Read {key}: {latest_value} (v{latest_version} from replica {source})")

        return latest_value

    def print_cluster_state(self):
        """Print state of all replicas."""
        print("\nCluster state:")
        for replica in self.replicas:
            print(f"  {replica}: {dict(replica.data)}")


# Demonstration: Strong vs Weak Consistency

def demo_strong_consistency():
    """Demonstrate strong consistency (W+R > N)."""
    print("=" * 60)
    print("DEMO 1: Strong Consistency (W+R > N)")
    print("=" * 60)

    cluster = QuorumCluster(num_replicas=3, write_quorum=2, read_quorum=2)
    print(f"N={cluster.N}, W={cluster.W}, R={cluster.R}")
    print(f"W+R={cluster.W + cluster.R} > N={cluster.N} - Strong consistency guaranteed\n")

    print("Step 1: Write 'user:42' = 'alice' to replicas [0, 1]")
    cluster.write("user:42", "alice", target_replicas=[0, 1])

    print("\nStep 2: Read 'user:42' from replicas [1, 2]")
    value = cluster.read("user:42", source_replicas=[1, 2])
    print(f"  Result: {value} (guaranteed to see the write)")

    cluster.print_cluster_state()
    print()


def demo_weak_consistency():
    """Demonstrate weak consistency (W+R <= N)."""
    print("=" * 60)
    print("DEMO 2: Weak Consistency (W+R <= N)")
    print("=" * 60)

    cluster = QuorumCluster(num_replicas=3, write_quorum=1, read_quorum=1)
    print(f"N={cluster.N}, W={cluster.W}, R={cluster.R}")
    print(f"W+R={cluster.W + cluster.R} <= N={cluster.N} - Eventual consistency only\n")

    print("Step 1: Write 'user:42' = 'alice' to replica [0] (only 1 of 3)")
    cluster.write("user:42", "alice", target_replicas=[0, 1, 2])

    print("\nStep 2: Read 'user:42' from replica [1]")
    print("  (Replica [1] hasn't received the write yet!)")
    value = cluster.read("user:42", source_replicas=[1, 2])
    print(f"  Result: {value} (may be stale or missing)")

    cluster.print_cluster_state()
    print("\nNote: Eventually, replica [1] and [2] receive the write via")
    print("      anti-entropy or hinted handoff.\n")


def demo_failover():
    """Demonstrate handling of node failure."""
    print("=" * 60)
    print("DEMO 3: Node Failure & Read Repair")
    print("=" * 60)

    cluster = QuorumCluster(num_replicas=3, write_quorum=2, read_quorum=2)
    print(f"N={cluster.N}, W={cluster.W}, R={cluster.R}\n")

    print("Step 1: Write 'user:42' = 'alice' to replicas [0, 1]")
    cluster.write("user:42", "alice", target_replicas=[0, 1, 2])

    cluster.print_cluster_state()

    print("\nStep 2: Replica [2] is down (or stale). Read from [1, 2]")
    print("  (Only [1] responds; [2] is unreachable)")
    value = cluster.read("user:42", source_replicas=[1, 2])

    print("\nStep 3: Read repair — update replica [2] with latest value")
    cluster.replicas[2].write("user:42", value, cluster.version_counter)

    cluster.print_cluster_state()
    print()


if __name__ == "__main__":
    demo_strong_consistency()
    demo_weak_consistency()
    demo_failover()

    print("=" * 60)
    print("KEY TAKEAWAY")
    print("=" * 60)
    print("W + R > N  ->  Strong consistency")
    print("W + R = N  ->  Eventual consistency")
    print("W + R < N  ->  Weak consistency (stale reads possible)")

When to Use / Avoid

Use Single-Leader When

  • ✅ Consistency is critical (banking, e-commerce)
  • ✅ Workload is mostly reads (read scaling via replicas)
  • ✅ Writes are bursty or low-volume
  • ✅ All data fits on a single machine
  • ✅ Replication lag of 100ms–1sec is acceptable

Use Multi-Leader When

  • ✅ Geo-distributed (multiple datacenters with local write latency)
  • ✅ Offline-capable (mobile apps, collaborative tools)
  • ✅ Conflicts are rare (users in different regions)
  • ✅ Application can handle eventual consistency

Use Leaderless When

  • ✅ Write throughput is extremely high (no bottleneck at primary)
  • ✅ Availability is paramount (no failover needed)
  • ✅ Acceptable eventual consistency (cache, events, analytics)
  • ✅ Partitioned data (can shard by key)

Replication Lag — Real-World Scenarios

Scenario 1: Read-Your-Writes Consistency Problem

1
2
3
4
5
T0: User alice writes profile.status = 'online' to primary (NY)
T1: Primary ACK to client (alice's browser)
T2: (5ms later) Alice's browser reads her profile from replica (SF)
T3: SF replica hasn't replicated yet; browser sees status='offline'
T4: Confusing bug — alice sees outdated version of her own data!

Solution: Sticky routing. After writing, route alice’s reads to the primary (NY) for 100ms, then switch to replica. Adds latency but guarantees consistency.

Scenario 2: Multi-Replica Read Inconsistency

1
2
3
4
5
T0: Write v1 to Primary; replicates to Replica-A
T1: Replica-B (slower) hasn't received v1 yet
T2: Client reads from Replica-B, gets v0 (stale)
T3: Client reads again from Replica-A, gets v1 (fresh)
T4: Client sees conflicting data! v0, then v1

Solution: Monotonic reads. Client remembers max version it has seen. On next read, doesn’t read from replicas with older version; waits for them to catch up or reads from primary.

Scenario 3: Write-Write Conflict (Multi-Leader)

1
2
3
4
5
T0: Alice in SF writes profile.status='online' to leader-SF
T1: Bob in London writes profile.status='offline' to leader-London
T2: Leaders replicate to each other
T3: Conflict! Both changes applied, depending on LWW resolution
T4: One write is silently lost (data loss!)

Solution: Use CRDTs or application logic to resolve conflicts explicitly (don’t silently lose data). Or partition data so Alice’s writes go to leader-SF and Bob’s go to leader-London.


References

This post is licensed under CC BY 4.0 by the author.