Post

Consensus — Paxos & Raft

Distributed nodes agree on a single value despite failures — the foundation of replicated state machines, leader election, and coordination in systems that cannot tolerate split-brain.

Consensus — Paxos & Raft

Distributed nodes agree on a single value despite failures — the foundation of replicated state machines, leader election, and coordination in systems that cannot tolerate split-brain.


The Core Problem

Achieving consensus in asynchronous networks with Byzantine failures (crashes, arbitrary messages) requires protocols that guarantee:

  1. Safety: All nodes eventually agree on the same value
  2. Liveness: Agreement is reached in finite time
  3. Termination: Even with f failures, majority can decide

In a system of N nodes, you can tolerate at most f = ⌊(N-1)/2⌋ failures. This is the quorum requirement: N/2 + 1 votes guarantee safety.


Key Properties

Property Paxos Raft
Consensus Safety Proven by Lamport (1998) Proven by Ongaro & Ousterhout (2014)
Liveness With leader election Explicit leader → simpler liveness
Leader Implicit (any can propose) Explicit single leader per term
Log replication Gaps allowed (complex recovery) Strict sequencing (no gaps)
Message complexity O(N²) in Paxos O(N) per heartbeat in Raft
Failure tolerance Up to ⌊(N-1)/2⌋ Up to ⌊(N-1)/2⌋

When to Use / Avoid

Use Consensus (Paxos / Raft) when:

  • You need atomic distributed decisions (etcd, CockroachDB)
  • Leader election is critical (Consul service discovery)
  • Metadata coordination requires safety (Kubernetes configuration)
  • System must survive f failures with N = 2f+1 nodes

Avoid consensus when:

  • Single datacenter with no Byzantine nodes (use simpler replication)
  • Application can tolerate eventual consistency (use gossip)
  • Latency-critical (consensus adds 1-2 network round trips)
  • Data is too large to replicate (use data replication + coordination service separately)

Real Systems Using Consensus

System Protocol Usage Deployment
etcd Raft Kubernetes API state, leader election, distributed locks 3-5 node clusters; Google Cloud, AWS, on-prem
CockroachDB Raft Range replication, writes across geo-distributed shards 3+ replicas per range; typical 5-50 nodes per region
TiKV (TiDB) Raft Distributed KV store for TiDB engine 3+ replicas; can span 10+ regions at scale
Consul Raft Service discovery, health checks, configuration 3-5 server nodes; thousands of client agents
InfluxDB Raft Cluster metadata, retention policies 3-node cluster for HA; writes to leaders only

How Real Systems Use This

etcd (Kubernetes): etcd runs as a 3-5 node Raft cluster and stores all Kubernetes cluster state (deployments, ConfigMaps, secrets). On write, the leader appends to its log, sends AppendEntries to followers, and waits for majority ACK before confirming to client. A typical etcd cluster in production has ~1MB/second write rate from the Kubernetes API server; at 3 nodes with async replication, end-to-end latency is ~100-150ms. The election timeout is set to 1000ms with 150-300ms heartbeat interval, so a leader failure triggers re-election within ~1.5 seconds.

CockroachDB: Every key range is replicated to 3+ nodes via Raft. When a client writes a key, it routes to the leader replica (identified via Raft), which replicates to followers synchronously (waits for 2/3 ACKs). CockroachDB uses a variant called “range-local” Raft where each replica sends raft state to followers in parallel, reducing latency to ~50-100ms for writes. Failures are detected via heartbeat timeout (150ms), and a new leader is elected within 300-500ms. CockroachDB’s multi-region setup places the leaseholder (leader) in the region nearest to clients, while replication spans all regions.

Consul: The Consul server cluster (typically 3-5 nodes) maintains the service registry and health state via Raft. Agents send health updates to any server; the leader applies them to the state machine. Consul’s leader election uses a 1000ms election timeout. With 1000s of services, the Raft state machine can grow to 10s of MB; Raft snapshots are taken every 16384 log entries to speed recovery. Consul uses Raft for strong consistency on reads to the leader, but allows eventual consistency reads to any node (a trade-off for availability).


Implementation: Raft Leader Election

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import random
import time
from dataclasses import dataclass
from enum import Enum
from typing import List, Optional

class NodeState(Enum):
    """Raft node states."""
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"

@dataclass
class RaftNode:
    """
    Minimal Raft node implementation.

    Persistent state (on stable storage):
    - currentTerm: Latest term this node has seen
    - votedFor: CandidateId that received vote in currentTerm (None if none)
    - log: List of log entries (not shown here for simplicity)

    Volatile state:
    - state: Current role (Follower, Candidate, Leader)
    - commitIndex: Index of highest log entry known to be committed
    - lastApplied: Index of highest log entry applied to state machine
    - electionTimeout: Time before triggering election
    - heartbeatInterval: How often leader sends heartbeats
    """

    node_id: int
    nodes: List['RaftNode']

    # Persistent state
    currentTerm: int = 0
    votedFor: Optional[int] = None
    log: List[dict] = None

    # Volatile state
    state: NodeState = NodeState.FOLLOWER
    commitIndex: int = 0
    lastApplied: int = 0

    # Timing
    last_heartbeat_time: float = 0
    election_timeout: float = 0  # Will be randomized
    heartbeat_interval: float = 0.15  # 150ms

    # Replication (for leaders)
    nextIndex: dict = None  # For each follower, index of next log entry to send
    matchIndex: dict = None  # For each follower, index of highest replicated entry

    def __post_init__(self):
        if self.log is None:
            self.log = []
        if self.nextIndex is None:
            self.nextIndex = {}
        if self.matchIndex is None:
            self.matchIndex = {}
        self._randomize_election_timeout()
        self.last_heartbeat_time = time.time()

    def _randomize_election_timeout(self):
        """Election timeout: randomized between 1000-1500ms."""
        self.election_timeout = random.uniform(1.0, 1.5)

    def is_election_timeout(self) -> bool:
        """Check if election timeout has elapsed since last heartbeat."""
        return (time.time() - self.last_heartbeat_time) > self.election_timeout

    def start_election(self):
        """
        Transition to CANDIDATE and request votes from all nodes.

        Implements RequestVote RPC:
        1. Increment currentTerm
        2. Vote for self
        3. Reset election timer
        4. Send RequestVote to all other nodes
        """
        self.currentTerm += 1
        self.state = NodeState.CANDIDATE
        self.votedFor = self.node_id
        self.last_heartbeat_time = time.time()
        self._randomize_election_timeout()

        # Collect votes
        votes_received = 1  # Vote for self
        for other_node in self.nodes:
            if other_node.node_id != self.node_id:
                if self.request_vote(other_node):
                    votes_received += 1

        # Check if won election (majority)
        majority = len(self.nodes) // 2 + 1
        if votes_received >= majority:
            self.become_leader()

    def request_vote(self, candidate_node: 'RaftNode') -> bool:
        """
        RequestVote RPC handler (other node's perspective).

        Voting rules:
        - If term < currentTerm, return False (old candidate)
        - If term == currentTerm and votedFor != None, return False (already voted)
        - If candidate's log is at least as up-to-date, grant vote
        """
        time.sleep(0.01)  # 10ms simulated network latency

        if candidate_node.currentTerm > self.currentTerm:
            self.currentTerm = candidate_node.currentTerm
            self.votedFor = None
            self.state = NodeState.FOLLOWER

        if (candidate_node.currentTerm == self.currentTerm and
            self.votedFor is None):
            self.votedFor = candidate_node.node_id
            return True

        return False

    def become_leader(self):
        """
        Transition to LEADER.

        On election, leader:
        1. Sends initial empty AppendEntries heartbeat to establish authority
        2. Initializes nextIndex and matchIndex
        """
        self.state = NodeState.LEADER
        self.last_heartbeat_time = time.time()

        last_log_index = len(self.log) - 1 if self.log else 0
        for other_node in self.nodes:
            if other_node.node_id != self.node_id:
                self.nextIndex[other_node.node_id] = last_log_index + 1
                self.matchIndex[other_node.node_id] = 0

    def send_heartbeat(self):
        """Leader sends heartbeat (empty AppendEntries) to followers."""
        if self.state != NodeState.LEADER:
            return

        for other_node in self.nodes:
            if other_node.node_id != self.node_id:
                other_node.last_heartbeat_time = time.time()

        self.last_heartbeat_time = time.time()

    def tick(self):
        """Simulate one time step."""
        if self.state == NodeState.LEADER:
            if time.time() - self.last_heartbeat_time > self.heartbeat_interval:
                self.send_heartbeat()
        else:
            if self.is_election_timeout():
                self.start_election()

Paxos vs Raft: Detailed Comparison

Dimension Paxos Raft
Consensus strength Can achieve agreement with any proposal mechanism Leader-based: simpler, but leader failure requires election
Proof complexity Lamport’s original proof is dense; variants (Multi-Paxos) add complexity Raft designed for understandability; cleaner proof in original paper
Log structure Allows gaps; requires complex recovery No gaps; strict sequencing simplifies implementation
Leader election Implicit; emerges from competing proposals Explicit; candidate with majority wins
Communication O(N²) proposals in worst case O(N) heartbeats + O(N) per log entry
Time to consensus 2 network round trips (propose + accept) 2 network round trips (same)
Used in production Google Chubby, Spanner internals etcd, CockroachDB, TiKV, Consul

Failure Scenarios

Scenario 1: Leader crashes

  • Followers detect no heartbeat after election_timeout
  • First follower to timeout increments term and becomes candidate
  • Majority votes for new leader within 2 round trips
  • Recovery time: 150-300ms (from election timeout) + RTT

Scenario 2: Network partition (split-brain)

  • If N=5 and partition splits 2/3:
    • Minority (2 nodes): Cannot achieve quorum → no writes
    • Majority (3 nodes): Forms new leader → continues
  • When partition heals: minority merges state from majority
  • Safety: Both sides agree (only majority makes decisions)

Scenario 3: Slow follower

  • Leader continues; matches log of slower follower at own pace
  • If follower is too far behind: snapshot transfer accelerates
  • Eventual consistency: slow follower catches up

References

Papers

  • Lamport, L. (1998). “Paxos Made Simple.” Microsoft Research MSR-TR-98-42. Foundational; read before Multi-Paxos.
  • Ongaro, D. & Ousterhout, J. (2014). “In Search of an Understandable Consensus Algorithm.” USENIX ATC. Raft design; includes TLA+ proofs.

Videos

  • Diego Ongaro: “Designing for Understandability” (USENIX 2014) — Clear explanation of Raft motivation.
  • PingCAP (TiKV): “Understanding Raft” — 3-part video series.

Implementations

References

This post is licensed under CC BY 4.0 by the author.