Distributed Consensus: From Paxos to Raft

The fundamental challenge of distributed computing can be stated simply: how do $n$ machines agree on a single value when any of them might fail, and messages between them may be delayed or lost?

This is the consensus problem, and it is deceptively hard.

The FLP Impossibility Result

In 1985, Fischer, Lynch, and Paterson proved that no deterministic consensus algorithm can guarantee termination in a fully asynchronous system if even one process may crash.

The formal statement: in an asynchronous message-passing system with $n \geq 2$ processes, where at most $f < n$ may fail by crashing, no algorithm satisfies all three of:

Termination — every non-faulty process eventually decides
Agreement — all non-faulty processes decide the same value
Validity — if all processes propose $v$ , then $v$ is decided

This seems to make consensus impossible. The practical escape hatch is that real systems are not fully asynchronous — they use timeouts and leader leases to make probabilistic progress guarantees.

Paxos: The Classic Algorithm

Paxos, due to Leslie Lamport (first described in 1989, published 1998), works in two phases:

Phase 1 — Prepare / Promise

A proposer picks a ballot number $n$ and sends Prepare(n) to a majority of acceptors. An acceptor responds with a promise not to accept any ballot $< n$ , and reports the highest-numbered ballot it has already accepted (if any).

Phase 2 — Accept / Accepted

If the proposer receives promises from a majority, it picks a value $v$ (either the highest-accepted value reported, or its own if no acceptor has accepted anything) and sends Accept(n, v). An acceptor accepts if it has not promised to ignore ballons $< n$ .

Why it works: any two majorities overlap in at least one acceptor. If two different proposers both succeed, they must both have contacted that overlapping acceptor — and both will have adopted the same value.

Paxos in Practice

Property	Guarantee
Safety	Always — only one value decided
Liveness	Only with eventual leader stability
Fault tolerance	Survives $⌊(n - 1) /2 ⌋$ failures
Log replication	Multi-Paxos extends single-decree Paxos

The notorious difficulty of Paxos is not the algorithm itself but its underspecification. Lamport’s paper describes single-decree Paxos; real systems need a replicated log, leader election, log compaction, membership changes — none of which are covered.

Raft: Designed for Understandability

Raft (Ongaro & Ousterhout, 2014) was explicitly designed to be easier to understand than Paxos, without sacrificing correctness. The paper’s subtitle — “In Search of an Understandable Consensus Algorithm” — is sincere.

The key decomposition:

Leader election — one leader per term
Log replication — leader accepts entries, replicates to followers
Safety — election rules ensure log consistency

Terms and Leader Election

Time is divided into terms, each starting with an election:

Term 1          Term 2    Term 3
|--Leader A-----|--elect--|--Leader B--|-->
                 ^split vote: no leader

A candidate wins by receiving votes from a majority. The election safety property guarantees at most one leader per term.

// Simplified Raft RequestVote RPC handler
func (s *RaftServer) RequestVote(args *RequestVoteArgs, reply *RequestVoteReply) {
    s.mu.Lock()
    defer s.mu.Unlock()

    reply.VoteGranted = false
    reply.Term = s.currentTerm

    // Reject stale candidates
    if args.Term < s.currentTerm {
        return
    }

    // Update term if we're behind
    if args.Term > s.currentTerm {
        s.currentTerm = args.Term
        s.votedFor = -1
        s.state = Follower
    }

    // Grant vote if:
    // 1. We haven't voted in this term (or already voted for this candidate)
    // 2. Candidate's log is at least as up-to-date as ours
    notVoted := s.votedFor == -1 || s.votedFor == args.CandidateID
    logOK := args.LastLogTerm > s.lastLogTerm() ||
        (args.LastLogTerm == s.lastLogTerm() && args.LastLogIndex >= s.lastLogIndex())

    if notVoted && logOK {
        s.votedFor = args.CandidateID
        reply.VoteGranted = true
        s.resetElectionTimer()
    }
}

Log Replication

Once elected, a leader:

Appends new entries to its log
Sends AppendEntries RPCs to all followers in parallel
Commits an entry once a majority acknowledges it
Applies committed entries to the state machine

The Log Matching Property ensures that if two logs have an entry with the same index and term, they are identical up to that point. This is enforced by Raft’s consistency check in AppendEntries.

Comparing Paxos and Raft

Dimension	Paxos	Raft
Design goal	Correctness proof	Understandability
Leader election	Implicit (any node can propose)	Explicit terms
Log gaps	Allowed (complex to handle)	Not allowed
Membership changes	Unspecified	Joint consensus / single-server
Implementations	Chubby, Zookeeper (ZAB), etcd (old)	etcd (new), CockroachDB, TiKV

The CAP Theorem Connection

In the presence of a network partition, a consensus system must choose between:

Consistency — all reads reflect the latest write
Availability — every request receives a response

Both Paxos and Raft are CP systems: they sacrifice availability during partitions (the minority partition cannot make progress) in order to guarantee consistency. This is the correct trade-off for a database transaction log or distributed lock service.

$Raft quorum = ⌊ \frac{n}{2} ⌋ + 1$

For a 5-node cluster, 3 nodes must agree before any entry is committed. The cluster tolerates 2 failures.

Practical Takeaways

Building production consensus is hard. Unless you have a very specific reason, prefer existing battle-tested implementations:

etcd — Raft, used by Kubernetes
CockroachDB — Multi-Raft over ranges
Apache ZooKeeper — ZAB (Zookeeper Atomic Broadcast, Paxos variant)
Consul — Raft

The algorithms are elegant. The devil is in the engineering details: log compaction, snapshot transfer, pre-vote optimization, and handling of network partitions that recover slowly.

Distributed consensus is one of the few areas in computing where the theory and the practice are equally hard.