Replication Strategies
Replicate data across multiple nodes to improve read scalability, availability, and disaster recovery -- choose your topology (single-leader, multi-leader, leaderless) and sync mode based on your durability/latency trade-offs.
Key Properties
| Property | Single-Leader | Multi-Leader | Leaderless (Quorum) |
|---|---|---|---|
| Write availability | ❌ Primary down = no writes | ✅ Write to any leader | ✅ Write to N nodes |
| Write conflicts | ❌ None | 🔶 Complex to resolve | ❌ None (quorum ensures consistency) |
| Replication lag | 10ms–1sec | 10ms–1sec | < 10ms (read from R nodes) |
| Failover time | 30sec–2min (detect + promote) | Automatic (any leader active) | Automatic (quorum picks latest) |
| Consistency model | Strong (primary) / Eventual (replicas) | Eventual (CRDTs or LWW) | Strong (W+R>N) / Eventual (W+R≤N) (see CAP Theorem) |
| Example systems | PostgreSQL, MySQL, MongoDB | CouchDB, MySQL multi-source | Cassandra, DynamoDB, Riak |
| Use case | OLTP (banking, e-commerce) | Geo-distributed, offline-capable | Analytics, high write throughput |
Replication Fundamentals
How Replication Works
All replication systems follow the same basic pattern:
- Write goes to leader(s) — Client sends write to primary/leader node(s).
- Leader applies write to local state — Update is written to leader’s database.
- Leader sends change to replicas — Write operation (or full state change) is sent to replica nodes.
- Replicas apply change — Each replica updates its local copy.
- Client is acknowledged — After replicas confirm (or immediately in async mode), client gets ACK.
The key variation is when the client is acknowledged:
- Synchronous: After replicas apply the write (durability guaranteed, but higher latency)
- Asynchronous: Immediately after leader applies (low latency, but data loss possible if leader crashes)
- Semi-synchronous: After leader + one replica apply (compromise)
Single-Leader (Primary-Follower) Replication
Architecture
One primary (leader) node accepts all writes. Multiple replica (follower) nodes asynchronously replicate the primary’s changes by consuming a replication log (PostgreSQL: WAL, MySQL: binlog, MongoDB: oplog).
1
2
3
4
5
6
7
Client writes → Primary (Pennsylvania)
↓ (WAL/binlog log shipped)
Replica 1 (Ohio)
Replica 2 (Virginia)
Reads can go to replicas (if eventual consistency OK)
All writes MUST go to primary
How Replication Works in Detail
PostgreSQL Streaming Replication
PostgreSQL ships the Write-Ahead Log (WAL) to replicas. Every transaction is written to the WAL on disk before the database heap pages are modified. Replicas connect to the primary and consume the WAL stream:
- Primary receives
INSERT INTO users VALUES (1, 'alice'). - Primary writes this to the WAL file (
000000010000000000000001). - Primary modifies heap pages in memory/disk.
- Replica reads from WAL stream:
LSN=0/16001234 (Log Sequence Number) - Replica replays the WAL: applies the INSERT to its own heap.
- Replica updates its own LSN to confirm it’s replayed up to that point.
Synchronous vs Asynchronous:
- Asynchronous (default): Primary acknowledges immediately after writing WAL locally. Replica replication happens in the background. If primary crashes before replica replicates, data is lost.
- Synchronous: Primary waits for replica to acknowledge it has written the WAL before responding to client. Guarantees replica has durable copy before client succeeds. Adds 10–50ms latency (round-trip to replica and back).
Real production numbers: A medium-sized PostgreSQL primary (100GB database, 10,000 QPS) generates ~500MB of WAL per hour. Streaming to a replica in the same datacenter incurs ~1ms latency; shipping to another continent adds 50–200ms depending on network quality.
MySQL Binlog Replication
MySQL records all data-modifying statements (or row changes) in the binlog. Replicas connect to the primary and download binlog events:
1
2
3
4
5
6
7
Primary: SELECT * FROM orders; (position 154)
INSERT INTO orders VALUES (...); (position 289)
→ sends to replica at position 154
Replica: Connects at position 154, receives INSERT event
Applies INSERT to its local database
Updates relay log position to 289
Gotcha — statement-based vs row-based:
- Statement-based (SBR): Replicate the actual SQL statement. Fast for large bulk operations. Problem: non-deterministic functions (
NOW(),RAND()) may produce different results on replica. - Row-based (RBR): Replicate the actual row changes (INSERT these specific columns). Larger binlog (every row change becomes an event). Deterministic — replica always produces identical results. Most production systems use RBR.
Real production numbers: A high-write MySQL primary (50,000 INSERTs/sec, row-based replication) generates ~2GB of binlog per hour. Replicas lag behind by 10–100ms depending on network congestion.
Read-Your-Writes Consistency Problem
Single-leader replication has a consistency gap: replicas lag behind the primary.
Scenario:
1
2
3
4
T1: Client writes to primary: UPDATE profile SET status='online'
T2: (1ms later) Primary returns OK
T3: Client immediately reads from replica: SELECT status FROM profile
T4: Replica hasn't replicated yet, so client sees status='offline' (stale!)
Solutions:
- Read-after-write consistency: For user’s own data, always read from primary.
1 2 3 4 5
if (user_id == current_user_id) { read from primary // Guarantee we see our own writes } else { read from replica // Eventual consistency OK for others' data } - Timestamp-based read routing: Client provides write timestamp; replica waits until it has replicated that timestamp before serving reads.
1 2
primary_version = client.write(data) // returns version 42 replica.read_with_version(42) // wait for replica to reach version 42
- Session-based consistency: Use sticky sessions; route all requests from same client to same replica, which guarantees it has replicated all previous writes.
Failover Challenges
When the primary crashes, the system must detect failure and promote a replica to primary.
Failure detection latency: How fast does the system realize the primary is down?
- Active health checking (ping) can detect failure in 100–500ms.
- In production, you usually see 30–60 seconds of silence before manual intervention or timeout kicks in.
Failover process:
- Detect primary failure (30–120 seconds).
- Choose which replica to promote (should have latest replication LSN).
- Promote replica to primary (a few seconds).
- Redirect clients to new primary (seconds to minutes).
- Catch up other replicas to new primary.
Total downtime: 30 seconds (detection) + 10 seconds (promotion) + 30 seconds (client redirect) = ~70–90 seconds in a well-automated system.
Data loss scenario: If the primary crashes after writing to local disk but before replicating to any replica (asynchronous mode), committed data is lost. The newly promoted replica won’t have that write. In synchronous mode, you guarantee at least one replica has the data, so failover has zero data loss (but higher write latency).
Multi-Leader (Multi-Master) Replication
Architecture
Multiple nodes are active leaders. Each leader accepts writes; leaders replicate their changes to each other.
1
2
3
Leader A (San Francisco) ←→ Leader B (London)
↓ ↓
Replica A1 (US) Replica B1 (EU)
Both leaders can serve write traffic simultaneously. This is critical for geo-distributed systems where sending all writes across the Atlantic to a single primary adds 50–100ms latency.
Write Conflict Resolution
With multiple leaders, the same data can be written on different leaders simultaneously:
Scenario:
1
2
3
4
T1: User in SF writes to Leader A: UPDATE profile SET status='online'
T2: User in London writes to Leader B: UPDATE profile SET status='offline'
T3: Leaders replicate their changes to each other
T4: Now both leaders have conflicting values!
Resolution strategies:
-
Last Write Wins (LWW): Each write has a timestamp. Later timestamp wins. Problem: data loss (one write discarded). Acceptable for non-critical data (user status, cache).
- CRDT (Conflict-free Replicated Data Type): Data structure that is designed so concurrent writes always merge correctly.
- Counter: Operations are commutative;
+1on A and+1on B both result in+2regardless of merge order. - Grow-only set: Only supports add, not remove. Conflict-free by definition.
- Register with metadata: Each write includes a unique replica ID; merge by comparing metadata (e.g., replica ‘A’ beats replica ‘B’ lexicographically).
- Counter: Operations are commutative;
- Application logic / Manual resolution: Detect conflicts and ask application/user to resolve.
Use Cases
- Offline-capable apps (CouchDB on mobile): Changes are replicated when connection is available; conflicts are resolved via CRDTs or application logic.
- Geo-distributed systems (multi-datacenter masters): Each datacenter has a leader for local write latency. Conflicts rare if data is partitioned by geography (different users in different regions).
- Collaborative editing (Google Docs): Uses OT (Operational Transformation) or CRDTs to merge concurrent edits.
Real Systems
- CouchDB: Supports multi-leader replication with CRDTs and application-level conflict handlers.
- MySQL multi-source replication: Multiple primary servers can replicate to each other; application must handle conflicts or use row-level locking.
- Riak: Every node is a leader; uses CRDTs and vector clocks to resolve conflicts.
Leaderless (Quorum-based) Replication
Architecture
Every node is equal. Writes go to multiple nodes in parallel, read from multiple nodes in parallel. No master/follower distinction.
1
2
3
4
5
6
7
Client writes to all N nodes concurrently:
Write(key='user:42', value='alice') → [Node 1, Node 2, Node 3]
Wait for W acknowledgments (W ≤ N)
Client reads from R nodes concurrently:
Read(key='user:42') → [Node 1, Node 3] (Pick latest by version/timestamp)
Return highest version
Quorum Mathematics
The quorum intersection property ensures that reads and writes overlap (see Consensus & Quorum for detailed quorum theory):
If W + R > N, then every read overlaps with every write.
1
2
3
4
Example: N=3, W=2, R=2
Write to 2 of 3 nodes → at least one node has the write
Read from 2 of 3 nodes → at least one of those 2 has the write
Proof: With 3 total nodes, 2+2=4 > 3, so at least one node is in both sets
Consistency table:
| N | W | R | Consistency | Example Use |
|---|---|---|---|---|
| 3 | 1 | 3 | Strong (R=N) | Read from all nodes; pick latest |
| 3 | 2 | 2 | Strong (W+R>N) | Cassandra “quorum” |
| 3 | 3 | 1 | Strong (W=N) | Write all before ACK; read any |
| 3 | 1 | 1 | Weak (W+R≤N) | Write one; read any (may be stale) |
Read Repair & Anti-Entropy
Nodes may temporarily have stale data. Recovery happens via:
- Read repair: During a read, if one node has stale data and another has newer data, the stale node is updated immediately.
- Anti-entropy (background repair): Periodically compare all replicas and fix mismatches (Merkle tree hashing).
Use Cases
- High write throughput: Write to W nodes in parallel (typically W=1 or W=2); asynchronously propagate to remaining nodes.
- High availability: No single point of failure (no primary); any node can accept writes/reads.
- Acceptable eventual consistency: Can tolerate reads returning slightly stale data if nodes are recovering.
Synchronous vs Asynchronous Replication
Synchronous Replication
Definition: Primary waits for replicas to acknowledge before responding to client.
Durability: ✅ Extremely high — client ACK guarantees replicas have written to disk.
Latency: 🔶 Higher — must wait for replica’s round-trip (10–100ms within datacenter, 100–500ms cross-continent).
Availability: ❌ If even one replica is slow/down, all writes are blocked.
Real-world impact: A synchronous replicated write on a primary in San Francisco with one replica in New York incurs ~50ms network round-trip. If you’re doing 1000 writes/sec, and each takes 50ms, you can only sustain 20 writes/sec (50ms * 1000 writes = 50 seconds). Throughput collapses.
Asynchronous Replication
Definition: Primary acknowledges immediately after writing to its own disk; replication happens in background.
Durability: ❌ Weak — if primary crashes before replicating, data is lost.
Latency: ✅ Very low — no waiting for replicas (~1–5ms to write locally).
Availability: ✅ Replicas can be down/slow without affecting writes.
Real-world impact: Same scenario (SF primary, NY replica). Each write takes ~1ms (local disk write only). You can sustain 10,000+ writes/sec. Asynchronously, replicas catch up in the background (typically within 100ms).
Semi-Synchronous Replication
Definition: Primary waits for one replica to acknowledge, then immediately responds to client. Other replicas replicate asynchronously in background.
Trade-off: Compromise between sync and async.
- Durability is strong: at least one replica has the write.
- Latency is reasonable: only waiting for one replica (~10–20ms).
- Availability is reasonable: one replica can be down without blocking writes.
Real production: PostgreSQL, MySQL, and MongoDB typically use semi-synchronous by default (wait for one follower before primary ACK).
How Real Systems Use Replication
PostgreSQL Streaming Replication
PostgreSQL uses synchronous or asynchronous WAL streaming based on synchronous_commit setting. A typical production setup: Primary in us-east-1, synchronous replica in us-east-1 (same AZ, ~1ms latency), and asynchronous replica in us-west-2 (cold standby for disaster recovery, no ACK impact).
Configuration:
1
synchronous_commit = 'remote_apply' -- Wait for replica to apply WAL
Behavior: Client write → Primary writes WAL → Primary sends WAL to sync replica → Sync replica writes and applies WAL → Sync replica ACK → Primary ACK to client. Total latency: ~5–15ms (1ms local write + 1ms network + 1ms replica write). Async replica eventually catches up (100ms lag typical). If primary crashes, promotion takes ~30 seconds (manual or automated tools like Patroni). If sync replica is overloaded, all writes slow down; this is why most systems prefer semi-sync (wait for only one replica to write WAL, not apply).
MySQL Binlog Replication
MySQL replicates binlog events. A typical production setup: Primary in us-east, replicas in us-east (for reads), and us-west (for geo-locality or warm standby). Asynchronous by default (client ACK after primary binlog write, replica lags 10–100ms).
The relay log on replica buffers the binlog from primary; a separate SQL thread applies relay log changes to the replica’s database. If this SQL thread falls behind the IO thread, replica lag increases (can reach minutes if applying complex transactions).
Production numbers: A primary handling 50,000 QPS generates ~2GB binlog/hour. Replicas with sufficient disk I/O stay within 100ms lag. If a replica is slow (I/O bound), lag can hit seconds; monitoring tools alert if lag > 10 seconds.
Cassandra Leaderless Quorum Replication
Cassandra is fully leaderless and eventually consistent. Every node is equivalent. When you write a key:
1
2
3
4
5
6
7
8
9
10
11
Write request → Cassandra coordinator (any node)
Coordinator determines N=3 replica nodes (by partition key hash)
Coordinator sends write to all 3 replicas in parallel
Coordinator waits for W=1 (by default) ACK
Client ACK after W ACKs received
Meanwhile, replicas that didn't ACK in time (due to network delay)
receive the write via **hinted handoff**:
Coordinator caches write for slow replica
Periodically retries sending the write
Slow replica eventually catches up
Consistency tuning:
W=1, R=1(fastest writes, eventual consistency) — acceptable for cache/eventsW=2, R=2(balanced) — good for most dataW=3, R=3(strong consistency) — use for critical data (but slower)
Conflict resolution: Cassandra uses timestamp (for last-write-wins) or CRDTs. If two writes happen to the same key on different nodes at the same time, the one with the higher timestamp wins; the other is discarded. This is LWW — acceptable for counters, user preferences, not for financial data.
DynamoDB Global Replication
DynamoDB replicates across multiple AWS regions for disaster recovery and geo-locality. You create a global table that replicates across regions:
1
2
3
Write to us-east-1 → Local commit → Replicate to eu-west-1, ap-south-1
↓ (eventually consistent)
Replicas eventually receive write (100ms–1sec)
By default, DynamoDB uses eventual consistency (read may return stale data from 100ms ago). You can request strongly consistent reads on a single region (costs 2x read units, adds latency).
Write conflict resolution: DynamoDB uses LWW with application-defined timestamps. Application provides the timestamp; highest timestamp wins. This is better than server timestamp because application can ensure causality (e.g., “write after transaction T100 must have timestamp > T100’s timestamp”).
MongoDB Replica Sets
MongoDB uses single-leader replication with automatic failover. A replica set is 3+ nodes: 1 primary + 2+ secondaries. When primary crashes, secondaries automatically elect a new primary (via Raft consensus).
1
2
3
4
5
6
7
8
9
Client writes to primary
Primary writes to oplog (operation log)
Secondaries read from primary oplog, apply operations locally
Read preference options:
- primary: only read from primary (strong consistency)
- primary_preferred: read from primary, fallback to secondary (fast)
- secondary: read from secondary only (fast reads, stale data)
- secondary_preferred: read from secondary, fallback to primary (balanced)
Replication lag: A busy primary might generate oplog entries faster than secondaries can apply. Lag is measured in bytes (how far behind in the oplog the secondary is). MongoDB can handle 10–100ms lag easily; if lag exceeds 1 second, alerts fire (secondary I/O is overloaded).
Failover: When primary crashes, secondaries detect it (heartbeat failure) within 10 seconds and hold an election. Any secondary with the latest oplog version becomes primary. Total downtime: 10–20 seconds (detection) + 5–10 seconds (election) = ~15–30 seconds. No data loss if the primary had written to its oplog before crashing.
Implementation — Quorum Read/Write Consistency Simulator
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
from typing import Dict, List, Optional, Tuple
from collections import defaultdict
import time
class QuorumReplica:
"""
Represents a single replica node in a leaderless system.
Stores versioned values; newer versions overwrite older ones.
"""
def __init__(self, node_id: int):
"""
Initialize a replica node.
Args:
node_id: unique identifier for this replica
"""
self.node_id = node_id
self.data: Dict[str, Tuple[int, any]] = {} # key -> (version, value)
def write(self, key: str, value: any, version: int) -> bool:
"""
Write a key-value pair with a version.
Returns True if write is applied (value is newer than current).
"""
if key not in self.data or self.data[key][0] < version:
self.data[key] = (version, value)
return True
return False
def read(self, key: str) -> Optional[Tuple[int, any]]:
"""
Read a key-value pair with its version.
Returns (version, value) if key exists, None otherwise.
"""
return self.data.get(key)
def __repr__(self):
return f"Replica(node={self.node_id}, keys={len(self.data)})"
class QuorumCluster:
"""
Simulates a leaderless cluster with quorum-based read/write.
Parameters:
- N: total replicas
- W: replicas required for write ACK
- R: replicas to read from for strong consistency
"""
def __init__(self, num_replicas: int, write_quorum: int, read_quorum: int):
"""
Initialize a quorum cluster.
Args:
num_replicas (N): total number of replica nodes
write_quorum (W): replicas required for write ACK
read_quorum (R): replicas to read from
"""
self.N = num_replicas
self.W = write_quorum
self.R = read_quorum
self.replicas = [QuorumReplica(i) for i in range(num_replicas)]
self.version_counter = 0
# Verify quorum property
if self.W + self.R <= self.N:
print(f"Warning: W+R ({self.W}+{self.R}) <= N ({self.N})")
print(" Reads may return stale data!")
def write(self, key: str, value: any, target_replicas: Optional[List[int]] = None) -> bool:
"""
Write to quorum of replicas.
Args:
key: key to write
value: value to write
target_replicas: list of replica IDs to write to (None = all)
Returns:
True if W replicas acknowledged, False otherwise
"""
if target_replicas is None:
target_replicas = list(range(self.N))
self.version_counter += 1
version = self.version_counter
acks = 0
for replica_id in target_replicas[:self.W]: # Write to first W
if self.replicas[replica_id].write(key, value, version):
acks += 1
success = acks >= self.W
status = "ACK" if success else "FAILED"
print(f" Write {key}={value} (v{version}) to replicas {target_replicas[:self.W]}: {status}")
return success
def read(self, key: str, source_replicas: Optional[List[int]] = None) -> Optional[any]:
"""
Read from quorum of replicas; return highest version.
Args:
key: key to read
source_replicas: list of replica IDs to read from (None = all)
Returns:
Latest value if found in at least R replicas, None otherwise
"""
if source_replicas is None:
source_replicas = list(range(self.N))
responses = []
for replica_id in source_replicas[:self.R]: # Read from first R
data = self.replicas[replica_id].read(key)
if data:
responses.append((data[0], data[1], replica_id))
if not responses:
print(f" Read {key}: NOT FOUND")
return None
# Return highest version
latest_version, latest_value, source = max(responses, key=lambda x: x[0])
print(f" Read {key}: {latest_value} (v{latest_version} from replica {source})")
return latest_value
def print_cluster_state(self):
"""Print state of all replicas."""
print("\nCluster state:")
for replica in self.replicas:
print(f" {replica}: {dict(replica.data)}")
# Demonstration: Strong vs Weak Consistency
def demo_strong_consistency():
"""Demonstrate strong consistency (W+R > N)."""
print("=" * 60)
print("DEMO 1: Strong Consistency (W+R > N)")
print("=" * 60)
cluster = QuorumCluster(num_replicas=3, write_quorum=2, read_quorum=2)
print(f"N={cluster.N}, W={cluster.W}, R={cluster.R}")
print(f"W+R={cluster.W + cluster.R} > N={cluster.N} - Strong consistency guaranteed\n")
print("Step 1: Write 'user:42' = 'alice' to replicas [0, 1]")
cluster.write("user:42", "alice", target_replicas=[0, 1])
print("\nStep 2: Read 'user:42' from replicas [1, 2]")
value = cluster.read("user:42", source_replicas=[1, 2])
print(f" Result: {value} (guaranteed to see the write)")
cluster.print_cluster_state()
print()
def demo_weak_consistency():
"""Demonstrate weak consistency (W+R <= N)."""
print("=" * 60)
print("DEMO 2: Weak Consistency (W+R <= N)")
print("=" * 60)
cluster = QuorumCluster(num_replicas=3, write_quorum=1, read_quorum=1)
print(f"N={cluster.N}, W={cluster.W}, R={cluster.R}")
print(f"W+R={cluster.W + cluster.R} <= N={cluster.N} - Eventual consistency only\n")
print("Step 1: Write 'user:42' = 'alice' to replica [0] (only 1 of 3)")
cluster.write("user:42", "alice", target_replicas=[0, 1, 2])
print("\nStep 2: Read 'user:42' from replica [1]")
print(" (Replica [1] hasn't received the write yet!)")
value = cluster.read("user:42", source_replicas=[1, 2])
print(f" Result: {value} (may be stale or missing)")
cluster.print_cluster_state()
print("\nNote: Eventually, replica [1] and [2] receive the write via")
print(" anti-entropy or hinted handoff.\n")
def demo_failover():
"""Demonstrate handling of node failure."""
print("=" * 60)
print("DEMO 3: Node Failure & Read Repair")
print("=" * 60)
cluster = QuorumCluster(num_replicas=3, write_quorum=2, read_quorum=2)
print(f"N={cluster.N}, W={cluster.W}, R={cluster.R}\n")
print("Step 1: Write 'user:42' = 'alice' to replicas [0, 1]")
cluster.write("user:42", "alice", target_replicas=[0, 1, 2])
cluster.print_cluster_state()
print("\nStep 2: Replica [2] is down (or stale). Read from [1, 2]")
print(" (Only [1] responds; [2] is unreachable)")
value = cluster.read("user:42", source_replicas=[1, 2])
print("\nStep 3: Read repair — update replica [2] with latest value")
cluster.replicas[2].write("user:42", value, cluster.version_counter)
cluster.print_cluster_state()
print()
if __name__ == "__main__":
demo_strong_consistency()
demo_weak_consistency()
demo_failover()
print("=" * 60)
print("KEY TAKEAWAY")
print("=" * 60)
print("W + R > N -> Strong consistency")
print("W + R = N -> Eventual consistency")
print("W + R < N -> Weak consistency (stale reads possible)")
When to Use / Avoid
Use Single-Leader When
- ✅ Consistency is critical (banking, e-commerce)
- ✅ Workload is mostly reads (read scaling via replicas)
- ✅ Writes are bursty or low-volume
- ✅ All data fits on a single machine
- ✅ Replication lag of 100ms–1sec is acceptable
Use Multi-Leader When
- ✅ Geo-distributed (multiple datacenters with local write latency)
- ✅ Offline-capable (mobile apps, collaborative tools)
- ✅ Conflicts are rare (users in different regions)
- ✅ Application can handle eventual consistency
Use Leaderless When
- ✅ Write throughput is extremely high (no bottleneck at primary)
- ✅ Availability is paramount (no failover needed)
- ✅ Acceptable eventual consistency (cache, events, analytics)
- ✅ Partitioned data (can shard by key)
Replication Lag — Real-World Scenarios
Scenario 1: Read-Your-Writes Consistency Problem
1
2
3
4
5
T0: User alice writes profile.status = 'online' to primary (NY)
T1: Primary ACK to client (alice's browser)
T2: (5ms later) Alice's browser reads her profile from replica (SF)
T3: SF replica hasn't replicated yet; browser sees status='offline'
T4: Confusing bug — alice sees outdated version of her own data!
Solution: Sticky routing. After writing, route alice’s reads to the primary (NY) for 100ms, then switch to replica. Adds latency but guarantees consistency.
Scenario 2: Multi-Replica Read Inconsistency
1
2
3
4
5
T0: Write v1 to Primary; replicates to Replica-A
T1: Replica-B (slower) hasn't received v1 yet
T2: Client reads from Replica-B, gets v0 (stale)
T3: Client reads again from Replica-A, gets v1 (fresh)
T4: Client sees conflicting data! v0, then v1
Solution: Monotonic reads. Client remembers max version it has seen. On next read, doesn’t read from replicas with older version; waits for them to catch up or reads from primary.
Scenario 3: Write-Write Conflict (Multi-Leader)
1
2
3
4
5
T0: Alice in SF writes profile.status='online' to leader-SF
T1: Bob in London writes profile.status='offline' to leader-London
T2: Leaders replicate to each other
T3: Conflict! Both changes applied, depending on LWW resolution
T4: One write is silently lost (data loss!)
Solution: Use CRDTs or application logic to resolve conflicts explicitly (don’t silently lose data). Or partition data so Alice’s writes go to leader-SF and Bob’s go to leader-London.
References
- 📄 Designing Data-Intensive Applications — Kleppmann, Ch. 5 — Comprehensive treatment of replication strategies, lag problems, and consistency models.
- 📄 The Dangers of Replication and a Solution — Wada et al. (2003) — Early analysis of replication lag and read-your-writes consistency.
- 🎥 ByteByteGo — Database Replication — Visual explanation of primary-replica, multi-leader, and quorum replication.
- 🔗 PostgreSQL Replication Documentation — Detailed guide to WAL archiving, streaming replication, and failover.
- 🔗 MySQL Replication Documentation — Binlog replication, relay logs, semi-synchronous replication.
- 🔗 Cassandra Replication Documentation — Quorum-based leaderless replication, hinted handoff, read repair.
- 🔗 DynamoDB Replication & Failover — Global tables, eventual vs strong consistency, cross-region replication.
- 📖 Wikipedia: Replication (computing) — Overview of replication topologies and consistency models.
- 📖 Wikipedia: Quorum — Mathematical foundation of quorum-based systems.