In distributed systems, we often need a single coordinator to:
Make decisions : Which node handles a request?
Manage resources : Coordinate access to shared resources
Maintain consistency : Ensure all nodes agree on state
Handle failures : Detect and recover from node failures
The Challenge:
How do multiple nodes agree on who the leader is? How do we handle leader failures? How do we prevent split-brain (multiple leaders)?
Leader election is the process of selecting a single node as the coordinator (leader) in a distributed system. The leader handles coordination tasks, while other nodes (followers) follow the leader’s decisions.
Safety : Only one leader at a time (no split-brain)
Liveness : Eventually a leader exists (even after failures)
Fault Tolerance : Handles node failures gracefully
Performance : Fast election, minimal overhead
Uniqueness : No conflicting leaders
Think of leader election like electing a class president :
Multiple candidates (nodes) can run
Election process determines winner (leader)
If president is absent (leader fails), new election held
Only one president at a time (safety)
Eventually someone is elected (liveness)
The Bully algorithm elects a leader based on node ID. The node with the highest ID wins.
How It Works:
When a node detects leader failure, it initiates election
Node sends election message to all nodes with higher IDs
If no response from higher nodes, node becomes leader
If response received, wait for leader announcement
Leader announces itself to all nodes
Advantages:
Simple to understand and implement
Fast election (O(n) messages)
Deterministic (highest ID always wins)
Disadvantages:
Can have multiple elections if multiple nodes detect failure simultaneously
Requires all nodes to know all other nodes
Not fault-tolerant if highest ID node is unstable
Raft is a consensus algorithm that provides leader election and log replication. It’s more complex than Bully but provides stronger guarantees.
Nodes in Raft can be in one of three states:
Leader : Handles all client requests, replicates log to followers
Follower : Receives log entries from leader, votes in elections
Candidate : Campaigning to become leader
How Raft Election Works:
Follower doesn’t receive heartbeat from leader (timeout)
Follower becomes candidate and increments term
Candidate requests votes from all nodes
If candidate receives majority votes, becomes leader
Leader sends heartbeats to prevent new elections
Key Features:
Majority voting : Prevents split-brain
Terms : Each election has a term number (monotonically increasing)
Log replication : Leader replicates log entries to followers
Safety : Only one leader per term
Apache ZooKeeper provides built-in support for leader election using ephemeral sequential nodes:
All nodes create ephemeral sequential nodes under /election
Node with smallest sequence number becomes leader
If leader fails, ephemeral node deleted, next node becomes leader
Nodes watch the node with sequence number one less than theirs
Advantages:
Handled by ZooKeeper (no custom implementation needed)
Automatic failure detection
No split-brain (ZooKeeper provides consistency)
Problem: Network partition causes multiple leaders.
Solution: Require majority vote (quorum). Only partition with majority can elect leader.
Problem: Multiple nodes start election simultaneously.
Solution: Use random election timeout. Reduces probability of simultaneous elections.
Problem: How to detect leader failure quickly?
Solution: Use heartbeat mechanism. If no heartbeat received within timeout, assume leader failed.
Leader handles writes, replicates to followers. Ensures consistency and handles failures.
Leader coordinates distributed operations, manages shared state.
Leader makes decisions, followers replicate state. Provides fault tolerance.
Advantages:
Provides single coordinator
Handles failures gracefully
Prevents split-brain (with majority voting)
Enables coordination
Disadvantages:
Adds complexity
Leader can become bottleneck
Network partitions can cause issues
Requires majority for election
Single Coordinator
Leader election ensures only one node acts as coordinator. Provides mutual exclusion for coordination tasks.
Majority Voting
Require majority votes to prevent split-brain. Only partition with majority can elect leader.
Fault Tolerance
Leader failures trigger new elections. System eventually elects new leader. Handles node failures gracefully.
Heartbeat Mechanism
Leader sends heartbeats to prevent elections. Followers detect leader failure via timeout. Triggers new election.