Consensus — Paxos & Raft
Distributed nodes agree on a single value despite failures — the foundation of replicated state machines, leader election, and coordination in systems that cannot tolerate split-brain.
Distributed nodes agree on a single value despite failures — the foundation of replicated state machines, leader election, and coordination in systems that cannot tolerate split-brain.
The Core Problem
Achieving consensus in asynchronous networks with Byzantine failures (crashes, arbitrary messages) requires protocols that guarantee:
- Safety: All nodes eventually agree on the same value
- Liveness: Agreement is reached in finite time
- Termination: Even with f failures, majority can decide
In a system of N nodes, you can tolerate at most f = ⌊(N-1)/2⌋ failures. This is the quorum requirement: N/2 + 1 votes guarantee safety.
Key Properties
| Property | Paxos | Raft |
|---|---|---|
| Consensus Safety | Proven by Lamport (1998) | Proven by Ongaro & Ousterhout (2014) |
| Liveness | With leader election | Explicit leader → simpler liveness |
| Leader | Implicit (any can propose) | Explicit single leader per term |
| Log replication | Gaps allowed (complex recovery) | Strict sequencing (no gaps) |
| Message complexity | O(N²) in Paxos | O(N) per heartbeat in Raft |
| Failure tolerance | Up to ⌊(N-1)/2⌋ | Up to ⌊(N-1)/2⌋ |
When to Use / Avoid
Use Consensus (Paxos / Raft) when:
- You need atomic distributed decisions (etcd, CockroachDB)
- Leader election is critical (Consul service discovery)
- Metadata coordination requires safety (Kubernetes configuration)
- System must survive f failures with N = 2f+1 nodes
Avoid consensus when:
- Single datacenter with no Byzantine nodes (use simpler replication)
- Application can tolerate eventual consistency (use gossip)
- Latency-critical (consensus adds 1-2 network round trips)
- Data is too large to replicate (use data replication + coordination service separately)
Real Systems Using Consensus
| System | Protocol | Usage | Deployment |
|---|---|---|---|
| etcd | Raft | Kubernetes API state, leader election, distributed locks | 3-5 node clusters; Google Cloud, AWS, on-prem |
| CockroachDB | Raft | Range replication, writes across geo-distributed shards | 3+ replicas per range; typical 5-50 nodes per region |
| TiKV (TiDB) | Raft | Distributed KV store for TiDB engine | 3+ replicas; can span 10+ regions at scale |
| Consul | Raft | Service discovery, health checks, configuration | 3-5 server nodes; thousands of client agents |
| InfluxDB | Raft | Cluster metadata, retention policies | 3-node cluster for HA; writes to leaders only |
How Real Systems Use This
etcd (Kubernetes): etcd runs as a 3-5 node Raft cluster and stores all Kubernetes cluster state (deployments, ConfigMaps, secrets). On write, the leader appends to its log, sends AppendEntries to followers, and waits for majority ACK before confirming to client. A typical etcd cluster in production has ~1MB/second write rate from the Kubernetes API server; at 3 nodes with async replication, end-to-end latency is ~100-150ms. The election timeout is set to 1000ms with 150-300ms heartbeat interval, so a leader failure triggers re-election within ~1.5 seconds.
CockroachDB: Every key range is replicated to 3+ nodes via Raft. When a client writes a key, it routes to the leader replica (identified via Raft), which replicates to followers synchronously (waits for 2/3 ACKs). CockroachDB uses a variant called “range-local” Raft where each replica sends raft state to followers in parallel, reducing latency to ~50-100ms for writes. Failures are detected via heartbeat timeout (150ms), and a new leader is elected within 300-500ms. CockroachDB’s multi-region setup places the leaseholder (leader) in the region nearest to clients, while replication spans all regions.
Consul: The Consul server cluster (typically 3-5 nodes) maintains the service registry and health state via Raft. Agents send health updates to any server; the leader applies them to the state machine. Consul’s leader election uses a 1000ms election timeout. With 1000s of services, the Raft state machine can grow to 10s of MB; Raft snapshots are taken every 16384 log entries to speed recovery. Consul uses Raft for strong consistency on reads to the leader, but allows eventual consistency reads to any node (a trade-off for availability).
Implementation: Raft Leader Election
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import random
import time
from dataclasses import dataclass
from enum import Enum
from typing import List, Optional
class NodeState(Enum):
"""Raft node states."""
FOLLOWER = "follower"
CANDIDATE = "candidate"
LEADER = "leader"
@dataclass
class RaftNode:
"""
Minimal Raft node implementation.
Persistent state (on stable storage):
- currentTerm: Latest term this node has seen
- votedFor: CandidateId that received vote in currentTerm (None if none)
- log: List of log entries (not shown here for simplicity)
Volatile state:
- state: Current role (Follower, Candidate, Leader)
- commitIndex: Index of highest log entry known to be committed
- lastApplied: Index of highest log entry applied to state machine
- electionTimeout: Time before triggering election
- heartbeatInterval: How often leader sends heartbeats
"""
node_id: int
nodes: List['RaftNode']
# Persistent state
currentTerm: int = 0
votedFor: Optional[int] = None
log: List[dict] = None
# Volatile state
state: NodeState = NodeState.FOLLOWER
commitIndex: int = 0
lastApplied: int = 0
# Timing
last_heartbeat_time: float = 0
election_timeout: float = 0 # Will be randomized
heartbeat_interval: float = 0.15 # 150ms
# Replication (for leaders)
nextIndex: dict = None # For each follower, index of next log entry to send
matchIndex: dict = None # For each follower, index of highest replicated entry
def __post_init__(self):
if self.log is None:
self.log = []
if self.nextIndex is None:
self.nextIndex = {}
if self.matchIndex is None:
self.matchIndex = {}
self._randomize_election_timeout()
self.last_heartbeat_time = time.time()
def _randomize_election_timeout(self):
"""Election timeout: randomized between 1000-1500ms."""
self.election_timeout = random.uniform(1.0, 1.5)
def is_election_timeout(self) -> bool:
"""Check if election timeout has elapsed since last heartbeat."""
return (time.time() - self.last_heartbeat_time) > self.election_timeout
def start_election(self):
"""
Transition to CANDIDATE and request votes from all nodes.
Implements RequestVote RPC:
1. Increment currentTerm
2. Vote for self
3. Reset election timer
4. Send RequestVote to all other nodes
"""
self.currentTerm += 1
self.state = NodeState.CANDIDATE
self.votedFor = self.node_id
self.last_heartbeat_time = time.time()
self._randomize_election_timeout()
# Collect votes
votes_received = 1 # Vote for self
for other_node in self.nodes:
if other_node.node_id != self.node_id:
if self.request_vote(other_node):
votes_received += 1
# Check if won election (majority)
majority = len(self.nodes) // 2 + 1
if votes_received >= majority:
self.become_leader()
def request_vote(self, candidate_node: 'RaftNode') -> bool:
"""
RequestVote RPC handler (other node's perspective).
Voting rules:
- If term < currentTerm, return False (old candidate)
- If term == currentTerm and votedFor != None, return False (already voted)
- If candidate's log is at least as up-to-date, grant vote
"""
time.sleep(0.01) # 10ms simulated network latency
if candidate_node.currentTerm > self.currentTerm:
self.currentTerm = candidate_node.currentTerm
self.votedFor = None
self.state = NodeState.FOLLOWER
if (candidate_node.currentTerm == self.currentTerm and
self.votedFor is None):
self.votedFor = candidate_node.node_id
return True
return False
def become_leader(self):
"""
Transition to LEADER.
On election, leader:
1. Sends initial empty AppendEntries heartbeat to establish authority
2. Initializes nextIndex and matchIndex
"""
self.state = NodeState.LEADER
self.last_heartbeat_time = time.time()
last_log_index = len(self.log) - 1 if self.log else 0
for other_node in self.nodes:
if other_node.node_id != self.node_id:
self.nextIndex[other_node.node_id] = last_log_index + 1
self.matchIndex[other_node.node_id] = 0
def send_heartbeat(self):
"""Leader sends heartbeat (empty AppendEntries) to followers."""
if self.state != NodeState.LEADER:
return
for other_node in self.nodes:
if other_node.node_id != self.node_id:
other_node.last_heartbeat_time = time.time()
self.last_heartbeat_time = time.time()
def tick(self):
"""Simulate one time step."""
if self.state == NodeState.LEADER:
if time.time() - self.last_heartbeat_time > self.heartbeat_interval:
self.send_heartbeat()
else:
if self.is_election_timeout():
self.start_election()
Paxos vs Raft: Detailed Comparison
| Dimension | Paxos | Raft |
|---|---|---|
| Consensus strength | Can achieve agreement with any proposal mechanism | Leader-based: simpler, but leader failure requires election |
| Proof complexity | Lamport’s original proof is dense; variants (Multi-Paxos) add complexity | Raft designed for understandability; cleaner proof in original paper |
| Log structure | Allows gaps; requires complex recovery | No gaps; strict sequencing simplifies implementation |
| Leader election | Implicit; emerges from competing proposals | Explicit; candidate with majority wins |
| Communication | O(N²) proposals in worst case | O(N) heartbeats + O(N) per log entry |
| Time to consensus | 2 network round trips (propose + accept) | 2 network round trips (same) |
| Used in production | Google Chubby, Spanner internals | etcd, CockroachDB, TiKV, Consul |
Failure Scenarios
Scenario 1: Leader crashes
- Followers detect no heartbeat after election_timeout
- First follower to timeout increments term and becomes candidate
- Majority votes for new leader within 2 round trips
- Recovery time: 150-300ms (from election timeout) + RTT
Scenario 2: Network partition (split-brain)
- If N=5 and partition splits 2/3:
- Minority (2 nodes): Cannot achieve quorum → no writes
- Majority (3 nodes): Forms new leader → continues
- When partition heals: minority merges state from majority
- Safety: Both sides agree (only majority makes decisions)
Scenario 3: Slow follower
- Leader continues; matches log of slower follower at own pace
- If follower is too far behind: snapshot transfer accelerates
- Eventual consistency: slow follower catches up
References
Papers
- Lamport, L. (1998). “Paxos Made Simple.” Microsoft Research MSR-TR-98-42. Foundational; read before Multi-Paxos.
- Ongaro, D. & Ousterhout, J. (2014). “In Search of an Understandable Consensus Algorithm.” USENIX ATC. Raft design; includes TLA+ proofs.
Videos
- Diego Ongaro: “Designing for Understandability” (USENIX 2014) — Clear explanation of Raft motivation.
- PingCAP (TiKV): “Understanding Raft” — 3-part video series.
Implementations
- etcd Raft Implementation
- Hashicorp Raft — Used in Consul, Vault
- Tikv Raft RS — Rust implementation used in TiKV
References
- Raft Visualization
- Raft Protocol Specification — Official site with papers, visualization, references