Consensus — Paxos & Raft

Distributed nodes agree on a single value despite failures — the foundation of replicated state machines, leader election, and coordination in systems that cannot tolerate split-brain.

Posted Mar 1, 2025

8 min read

Distributed nodes agree on a single value despite failures — the foundation of replicated state machines, leader election, and coordination in systems that cannot tolerate split-brain.

The Core Problem

Achieving consensus in asynchronous networks with Byzantine failures (crashes, arbitrary messages) requires protocols that guarantee:

Safety: All nodes eventually agree on the same value
Liveness: Agreement is reached in finite time
Termination: Even with f failures, majority can decide

In a system of N nodes, you can tolerate at most f = ⌊(N-1)/2⌋ failures. This is the quorum requirement: N/2 + 1 votes guarantee safety.

Key Properties

Property	Paxos	Raft
Consensus Safety	Proven by Lamport (1998)	Proven by Ongaro & Ousterhout (2014)
Liveness	With leader election	Explicit leader → simpler liveness
Leader	Implicit (any can propose)	Explicit single leader per term
Log replication	Gaps allowed (complex recovery)	Strict sequencing (no gaps)
Message complexity	O(N²) in Paxos	O(N) per heartbeat in Raft
Failure tolerance	Up to ⌊(N-1)/2⌋	Up to ⌊(N-1)/2⌋

When to Use / Avoid

Use Consensus (Paxos / Raft) when:

You need atomic distributed decisions (etcd, CockroachDB)
Leader election is critical (Consul service discovery)
Metadata coordination requires safety (Kubernetes configuration)
System must survive f failures with N = 2f+1 nodes

Avoid consensus when:

Single datacenter with no Byzantine nodes (use simpler replication)
Application can tolerate eventual consistency (use gossip)
Latency-critical (consensus adds 1-2 network round trips)
Data is too large to replicate (use data replication + coordination service separately)

Real Systems Using Consensus

System	Protocol	Usage	Deployment
etcd	Raft	Kubernetes API state, leader election, distributed locks	3-5 node clusters; Google Cloud, AWS, on-prem
CockroachDB	Raft	Range replication, writes across geo-distributed shards	3+ replicas per range; typical 5-50 nodes per region
TiKV (TiDB)	Raft	Distributed KV store for TiDB engine	3+ replicas; can span 10+ regions at scale
Consul	Raft	Service discovery, health checks, configuration	3-5 server nodes; thousands of client agents
InfluxDB	Raft	Cluster metadata, retention policies	3-node cluster for HA; writes to leaders only

How Real Systems Use This

etcd (Kubernetes): etcd runs as a 3-5 node Raft cluster and stores all Kubernetes cluster state (deployments, ConfigMaps, secrets). On write, the leader appends to its log, sends AppendEntries to followers, and waits for majority ACK before confirming to client. A typical etcd cluster in production has ~1MB/second write rate from the Kubernetes API server; at 3 nodes with async replication, end-to-end latency is ~100-150ms. The election timeout is set to 1000ms with 150-300ms heartbeat interval, so a leader failure triggers re-election within ~1.5 seconds.

CockroachDB: Every key range is replicated to 3+ nodes via Raft. When a client writes a key, it routes to the leader replica (identified via Raft), which replicates to followers synchronously (waits for 2/3 ACKs). CockroachDB uses a variant called “range-local” Raft where each replica sends raft state to followers in parallel, reducing latency to ~50-100ms for writes. Failures are detected via heartbeat timeout (150ms), and a new leader is elected within 300-500ms. CockroachDB’s multi-region setup places the leaseholder (leader) in the region nearest to clients, while replication spans all regions.

Consul: The Consul server cluster (typically 3-5 nodes) maintains the service registry and health state via Raft. Agents send health updates to any server; the leader applies them to the state machine. Consul’s leader election uses a 1000ms election timeout. With 1000s of services, the Raft state machine can grow to 10s of MB; Raft snapshots are taken every 16384 log entries to speed recovery. Consul uses Raft for strong consistency on reads to the leader, but allows eventual consistency reads to any node (a trade-off for availability).

Implementation: Raft Leader Election

        
      
import random
import time
from dataclasses import dataclass
from enum import Enum
from typing import List, Optional

class NodeState(Enum):
    """Raft node states."""
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"

@dataclass
class RaftNode:
    """
    Minimal Raft node implementation.

    Persistent state (on stable storage):
    - currentTerm: Latest term this node has seen
    - votedFor: CandidateId that received vote in currentTerm (None if none)
    - log: List of log entries (not shown here for simplicity)

    Volatile state:
    - state: Current role (Follower, Candidate, Leader)
    - commitIndex: Index of highest log entry known to be committed
    - lastApplied: Index of highest log entry applied to state machine
    - electionTimeout: Time before triggering election
    - heartbeatInterval: How often leader sends heartbeats
    """

    node_id: int
    nodes: List['RaftNode']

    # Persistent state
    currentTerm: int = 0
    votedFor: Optional[int] = None
    log: List[dict] = None

    # Volatile state
    state: NodeState = NodeState.FOLLOWER
    commitIndex: int = 0
    lastApplied: int = 0

    # Timing
    last_heartbeat_time: float = 0
    election_timeout: float = 0  # Will be randomized
    heartbeat_interval: float = 0.15  # 150ms

    # Replication (for leaders)
    nextIndex: dict = None  # For each follower, index of next log entry to send
    matchIndex: dict = None  # For each follower, index of highest replicated entry

    def __post_init__(self):
        if self.log is None:
            self.log = []
        if self.nextIndex is None:
            self.nextIndex = {}
        if self.matchIndex is None:
            self.matchIndex = {}
        self._randomize_election_timeout()
        self.last_heartbeat_time = time.time()

    def _randomize_election_timeout(self):
        """Election timeout: randomized between 1000-1500ms."""
        self.election_timeout = random.uniform(1.0, 1.5)

    def is_election_timeout(self) -> bool:
        """Check if election timeout has elapsed since last heartbeat."""
        return (time.time() - self.last_heartbeat_time) > self.election_timeout

    def start_election(self):
        """
        Transition to CANDIDATE and request votes from all nodes.

        Implements RequestVote RPC:
        1. Increment currentTerm
        2. Vote for self
        3. Reset election timer
        4. Send RequestVote to all other nodes
        """
        self.currentTerm += 1
        self.state = NodeState.CANDIDATE
        self.votedFor = self.node_id
        self.last_heartbeat_time = time.time()
        self._randomize_election_timeout()

        # Collect votes
        votes_received = 1  # Vote for self
        for other_node in self.nodes:
            if other_node.node_id != self.node_id:
                if self.request_vote(other_node):
                    votes_received += 1

        # Check if won election (majority)
        majority = len(self.nodes) // 2 + 1
        if votes_received >= majority:
            self.become_leader()

    def request_vote(self, candidate_node: 'RaftNode') -> bool:
        """
        RequestVote RPC handler (other node's perspective).

        Voting rules:
        - If term < currentTerm, return False (old candidate)
        - If term == currentTerm and votedFor != None, return False (already voted)
        - If candidate's log is at least as up-to-date, grant vote
        """
        time.sleep(0.01)  # 10ms simulated network latency

        if candidate_node.currentTerm > self.currentTerm:
            self.currentTerm = candidate_node.currentTerm
            self.votedFor = None
            self.state = NodeState.FOLLOWER

        if (candidate_node.currentTerm == self.currentTerm and
            self.votedFor is None):
            self.votedFor = candidate_node.node_id
            return True

        return False

    def become_leader(self):
        """
        Transition to LEADER.

        On election, leader:
        1. Sends initial empty AppendEntries heartbeat to establish authority
        2. Initializes nextIndex and matchIndex
        """
        self.state = NodeState.LEADER
        self.last_heartbeat_time = time.time()

        last_log_index = len(self.log) - 1 if self.log else 0
        for other_node in self.nodes:
            if other_node.node_id != self.node_id:
                self.nextIndex[other_node.node_id] = last_log_index + 1
                self.matchIndex[other_node.node_id] = 0

    def send_heartbeat(self):
        """Leader sends heartbeat (empty AppendEntries) to followers."""
        if self.state != NodeState.LEADER:
            return

        for other_node in self.nodes:
            if other_node.node_id != self.node_id:
                other_node.last_heartbeat_time = time.time()

        self.last_heartbeat_time = time.time()

    def tick(self):
        """Simulate one time step."""
        if self.state == NodeState.LEADER:
            if time.time() - self.last_heartbeat_time > self.heartbeat_interval:
                self.send_heartbeat()
        else:
            if self.is_election_timeout():
                self.start_election()

Paxos vs Raft: Detailed Comparison

Dimension	Paxos	Raft
Consensus strength	Can achieve agreement with any proposal mechanism	Leader-based: simpler, but leader failure requires election
Proof complexity	Lamport’s original proof is dense; variants (Multi-Paxos) add complexity	Raft designed for understandability; cleaner proof in original paper
Log structure	Allows gaps; requires complex recovery	No gaps; strict sequencing simplifies implementation
Leader election	Implicit; emerges from competing proposals	Explicit; candidate with majority wins
Communication	O(N²) proposals in worst case	O(N) heartbeats + O(N) per log entry
Time to consensus	2 network round trips (propose + accept)	2 network round trips (same)
Used in production	Google Chubby, Spanner internals	etcd, CockroachDB, TiKV, Consul

Failure Scenarios

Scenario 1: Leader crashes

Followers detect no heartbeat after election_timeout
First follower to timeout increments term and becomes candidate
Majority votes for new leader within 2 round trips
Recovery time: 150-300ms (from election timeout) + RTT

Scenario 2: Network partition (split-brain)

If N=5 and partition splits 2/3:
- Minority (2 nodes): Cannot achieve quorum → no writes
- Majority (3 nodes): Forms new leader → continues
When partition heals: minority merges state from majority
Safety: Both sides agree (only majority makes decisions)

Scenario 3: Slow follower

Leader continues; matches log of slower follower at own pace
If follower is too far behind: snapshot transfer accelerates
Eventual consistency: slow follower catches up

References

Papers

Lamport, L. (1998). “Paxos Made Simple.” Microsoft Research MSR-TR-98-42. Foundational; read before Multi-Paxos.
Ongaro, D. & Ousterhout, J. (2014). “In Search of an Understandable Consensus Algorithm.” USENIX ATC. Raft design; includes TLA+ proofs.

Videos

Diego Ongaro: “Designing for Understandability” (USENIX 2014) — Clear explanation of Raft motivation.
PingCAP (TiKV): “Understanding Raft” — 3-part video series.

Implementations

etcd Raft Implementation
Hashicorp Raft — Used in Consul, Vault
Tikv Raft RS — Rust implementation used in TiKV

References

Raft Visualization
Raft Protocol Specification — Official site with papers, visualization, references

Distributed Systems

consensus paxos raft leader-election safety liveness

This post is licensed under CC BY 4.0 by the author.