The fundamental challenge of distributed computing can be stated simply: how do machines agree on a single value when any of them might fail, and messages between them may be delayed or lost?
This is the consensus problem, and it is deceptively hard.
The FLP Impossibility Result
In 1985, Fischer, Lynch, and Paterson proved that no deterministic consensus algorithm can guarantee termination in a fully asynchronous system if even one process may crash.
The formal statement: in an asynchronous message-passing system with processes, where at most may fail by crashing, no algorithm satisfies all three of:
- Termination — every non-faulty process eventually decides
- Agreement — all non-faulty processes decide the same value
- Validity — if all processes propose , then is decided
This seems to make consensus impossible. The practical escape hatch is that real systems are not fully asynchronous — they use timeouts and leader leases to make probabilistic progress guarantees.
Paxos: The Classic Algorithm
Paxos, due to Leslie Lamport (first described in 1989, published 1998), works in two phases:
Phase 1 — Prepare / Promise
A proposer picks a ballot number and sends Prepare(n) to a majority of acceptors. An acceptor responds with a promise not to accept any ballot , and reports the highest-numbered ballot it has already accepted (if any).
Phase 2 — Accept / Accepted
If the proposer receives promises from a majority, it picks a value (either the highest-accepted value reported, or its own if no acceptor has accepted anything) and sends Accept(n, v). An acceptor accepts if it has not promised to ignore ballons .
Why it works: any two majorities overlap in at least one acceptor. If two different proposers both succeed, they must both have contacted that overlapping acceptor — and both will have adopted the same value.
Paxos in Practice
| Property | Guarantee |
|---|---|
| Safety | Always — only one value decided |
| Liveness | Only with eventual leader stability |
| Fault tolerance | Survives failures |
| Log replication | Multi-Paxos extends single-decree Paxos |
The notorious difficulty of Paxos is not the algorithm itself but its underspecification. Lamport’s paper describes single-decree Paxos; real systems need a replicated log, leader election, log compaction, membership changes — none of which are covered.
Raft: Designed for Understandability
Raft (Ongaro & Ousterhout, 2014) was explicitly designed to be easier to understand than Paxos, without sacrificing correctness. The paper’s subtitle — “In Search of an Understandable Consensus Algorithm” — is sincere.
The key decomposition:
- Leader election — one leader per term
- Log replication — leader accepts entries, replicates to followers
- Safety — election rules ensure log consistency
Terms and Leader Election
Time is divided into terms, each starting with an election:
Term 1 Term 2 Term 3
|--Leader A-----|--elect--|--Leader B--|-->
^split vote: no leader A candidate wins by receiving votes from a majority. The election safety property guarantees at most one leader per term.
// Simplified Raft RequestVote RPC handler
func (s *RaftServer) RequestVote(args *RequestVoteArgs, reply *RequestVoteReply) {
s.mu.Lock()
defer s.mu.Unlock()
reply.VoteGranted = false
reply.Term = s.currentTerm
// Reject stale candidates
if args.Term < s.currentTerm {
return
}
// Update term if we're behind
if args.Term > s.currentTerm {
s.currentTerm = args.Term
s.votedFor = -1
s.state = Follower
}
// Grant vote if:
// 1. We haven't voted in this term (or already voted for this candidate)
// 2. Candidate's log is at least as up-to-date as ours
notVoted := s.votedFor == -1 || s.votedFor == args.CandidateID
logOK := args.LastLogTerm > s.lastLogTerm() ||
(args.LastLogTerm == s.lastLogTerm() && args.LastLogIndex >= s.lastLogIndex())
if notVoted && logOK {
s.votedFor = args.CandidateID
reply.VoteGranted = true
s.resetElectionTimer()
}
}Log Replication
Once elected, a leader:
- Appends new entries to its log
- Sends
AppendEntriesRPCs to all followers in parallel - Commits an entry once a majority acknowledges it
- Applies committed entries to the state machine
The Log Matching Property ensures that if two logs have an entry with the same index and term, they are identical up to that point. This is enforced by Raft’s consistency check in AppendEntries.
Comparing Paxos and Raft
| Dimension | Paxos | Raft |
|---|---|---|
| Design goal | Correctness proof | Understandability |
| Leader election | Implicit (any node can propose) | Explicit terms |
| Log gaps | Allowed (complex to handle) | Not allowed |
| Membership changes | Unspecified | Joint consensus / single-server |
| Implementations | Chubby, Zookeeper (ZAB), etcd (old) | etcd (new), CockroachDB, TiKV |
The CAP Theorem Connection
In the presence of a network partition, a consensus system must choose between:
- Consistency — all reads reflect the latest write
- Availability — every request receives a response
Both Paxos and Raft are CP systems: they sacrifice availability during partitions (the minority partition cannot make progress) in order to guarantee consistency. This is the correct trade-off for a database transaction log or distributed lock service.
For a 5-node cluster, 3 nodes must agree before any entry is committed. The cluster tolerates 2 failures.
Practical Takeaways
Building production consensus is hard. Unless you have a very specific reason, prefer existing battle-tested implementations:
- etcd — Raft, used by Kubernetes
- CockroachDB — Multi-Raft over ranges
- Apache ZooKeeper — ZAB (Zookeeper Atomic Broadcast, Paxos variant)
- Consul — Raft
The algorithms are elegant. The devil is in the engineering details: log compaction, snapshot transfer, pre-vote optimization, and handling of network partitions that recover slowly.
Distributed consensus is one of the few areas in computing where the theory and the practice are equally hard.