Leader election

Leader election is the process by which a cluster of equal nodes agrees on a single one to coordinate work, and re-elects automatically when it dies.

Why it matters

Some jobs must be done by exactly one node — advancing a replication log, running a scheduler, assigning shards — or you get split-brain double-writes and corruption. Hard-coding the leader creates a single point of failure; leader election gives you a designated coordinator plus automatic failover, so the cluster keeps a single source of truth while surviving the loss of any node.

How it works

Election rests on a consensus protocol (Raft, Paxos) or a coordination service (ZooKeeper, etcd). The core safety rule is a quorum: a leader is valid only with votes from a majority (N/2 + 1), which mathematically prevents two leaders in disjoint partitions.

  • Term / epoch — a monotonically increasing number; each election bumps it so stale leaders are fenced off.
  • Lease + heartbeat — the leader holds a time-bound lease and renews it; missed renewals trigger a new election.
  • Fencing token — the epoch is attached to every write so downstream systems reject a deposed leader’s late writes.
Raft (5 nodes): leader heartbeats every 150ms
  follower misses heartbeat past random timeout
    → becomes candidate, term++ , requests votes
    → gets 3/5 votes → becomes leader for new term
  old leader returns with lower term → steps down

This is the machinery behind a single-writer primary. It is a strict CP choice: during a partition the minority side has no leader and rejects writes to stay consistent.

Example

A 3-node Postgres cluster with Patroni stores leader state in etcd.

node-A holds the leader lease (epoch 7), serves writes
node-A's host freezes → lease expires after 30s
etcd election → node-B wins, epoch 8, promotes to primary
node-A thaws, tries to write → fenced: its epoch 7 < 8 → demoted to replica

No window with two primaries, so no conflicting writes.

Pitfalls

  • No fencing token — a paused old leader can wake and corrupt data; every write must carry and check the epoch.
  • Even node counts — 2 or 4 nodes can’t form a clean majority in a 50/50 split; use odd counts (3, 5).
  • Lease shorter than failover work — flapping leaders re-elect constantly; size the lease above realistic detection time.
  • Co-locating the coordination store — running etcd on the same hosts couples its failure to the workload it protects.

See also