Replication Strategies

Copies of data that keep your system alive

What is Replication?

Replication means keeping copies of the same data on multiple machines. Think of it like having backup copies of important documents in different locations—if one is destroyed, you still have access.

Why Replicate?

Goal	How Replication Helps
High Availability	If one node dies, others continue serving
Read Performance	Distribute reads across replicas
Latency	Place replicas closer to users geographically
Disaster Recovery	Replicas in different data centers survive regional failures

Strategy 1: Leader-Follower (Primary-Replica)

The most common replication strategy. One leader handles all writes; followers replicate from the leader and serve reads.

How It Works

Client sends a write to the leader
Leader persists the data locally
Leader sends data to all followers
Followers apply the changes to their copies
Reads can go to any follower (or the leader)

Synchronous vs Asynchronous Replication

The critical question: Does the leader wait for followers before confirming the write?

Synchronous: The “Safe” Option

How it works: Leader waits until ALL (or some) followers confirm they received the data before telling the client “write successful.”

Like: Sending a registered letter—you wait for delivery confirmation.

Trade-off: If any follower is slow or down, the entire write is delayed or blocked.

Asynchronous: The “Fast” Option

How it works: Leader immediately tells client “write successful,” then sends data to followers in the background.

Like: Sending a regular letter—you drop it in the mailbox and assume it’ll arrive.

Trade-off: If the leader crashes before replication completes, those writes are lost forever.

Strategy 2: Multi-Leader Replication

Multiple nodes can accept writes. Each leader replicates to others. This is common for geo-distributed systems.

When to Use Multi-Leader

Multi-datacenter deployment — Users in US write to US leader, EU users write to EU leader
Offline clients — Mobile apps that work offline (each device is a “leader”)
Collaborative editing — Google Docs-style real-time collaboration

The Conflict Problem

What happens when two leaders accept conflicting writes at the same time?

Imagine this scenario:

User A in the US changes their username to “alice_new”
User B in the EU changes the same username to “alice_updated”
Both leaders accept the write locally
When they sync… which one wins?

Conflict Resolution Strategies

Strategy	How It Works	Best For
Last-Write-Wins (LWW)	Most recent timestamp wins; older write is discarded	Simple data, acceptable to lose updates
First-Write-Wins	First timestamp wins; reject later writes	Immutable records
Merge	Combine both values using domain logic	Shopping carts, sets, counters
Custom/User Resolution	Store both, let app or user decide	Documents, complex data

Strategy 3: Leaderless Replication

No designated leader — any node can accept reads and writes. Used by systems like Cassandra, DynamoDB, and Riak.

Quorum: The Magic Formula

The key concept is quorum — a voting system for consistency:

N = Total number of replicas
W = Number of nodes that must confirm a write
R = Number of nodes that must respond to a read

The Rule: If W + R > N, you’re guaranteed to read at least one node with the latest data.

Common Quorum Configurations

Config	W	R	Trade-off
Balanced	2	2	Good consistency + availability
Write-heavy	1	3	Fast writes, slower reads
Read-heavy	3	1	Slow writes, fast reads

Handling Replication Lag

With async replication, followers may be behind. This creates read consistency challenges:

Read Consistency Levels

Level	What It Guarantees	How To Achieve
Eventual	Data will sync “eventually”	Read from any replica
Read-Your-Writes	See your own writes immediately	Read from leader after write, or track write timestamps
Monotonic Reads	Never see older data than before	Stick to same replica, or track read positions
Strong/Linearizable	Always see latest	Read from leader only

Comparing All Three Strategies

Aspect	Leader-Follower	Multi-Leader	Leaderless
Write Scalability	Limited (1 leader)	Good (multiple leaders)	Best (any node)
Read Scalability	Good (add followers)	Good	Good
Consistency	Easier to achieve	Conflict resolution needed	Quorum-based
Latency (geo)	High (single leader)	Low (local leaders)	Low
Complexity	Simplest	Complex (conflicts)	Complex (quorums)
Examples	MySQL, PostgreSQL	CouchDB, Google Docs	Cassandra, DynamoDB

Real-World Examples

Example 1: MySQL Leader-Follower Replication (Synchronous)

Company: Facebook, GitHub, WordPress

Scenario: MySQL uses leader-follower replication where one primary database handles all writes and multiple replica databases handle reads. This provides high availability and read scalability.

Implementation: Uses semi-synchronous replication by default:

Why Leader-Follower?

Simplicity: Single source of truth for writes
Consistency: Easier to maintain strong consistency
Read Scalability: Add replicas to scale reads horizontally
Failover: Automatic promotion of replica to primary on failure

Real-World Impact:

Facebook: Uses MySQL with leader-follower for billions of users
GitHub: MySQL replicas handle read traffic, reducing primary load
Performance: Read queries distributed across replicas, 10x read capacity

Example 2: Google Docs Multi-Leader Replication

Company: Google

Scenario: Google Docs allows multiple users to edit the same document simultaneously. Each user’s browser acts as a “leader” that can accept writes, and changes are synchronized across all clients.

Implementation: Uses multi-leader replication with operational transformation:

Why Multi-Leader?

Low Latency: Users write to nearest server
Offline Support: Works offline, syncs when online
Collaboration: Multiple simultaneous editors
Conflict Resolution: Operational transformation merges edits

Real-World Impact:

Scale: Supports 50+ simultaneous editors per document
Latency: < 100ms write latency (local server)
Consistency: All users see same document within seconds

Example 3: Amazon DynamoDB Leaderless Replication

Company: Amazon

Scenario: DynamoDB is a NoSQL database that uses leaderless replication. Any node can accept reads and writes, providing high availability and low latency.

Implementation: Uses quorum-based leaderless replication:

Why Leaderless?

No Single Point of Failure: No leader to fail
High Availability: System works even if nodes fail
Low Latency: Write to nearest nodes
Scalability: Add nodes to increase capacity

Real-World Impact:

Scale: Handles millions of requests per second
Availability: 99.99% uptime SLA
Latency: Single-digit millisecond latency
Durability: 99.999999999% (11 nines) durability

Example 4: PostgreSQL Streaming Replication (Asynchronous)

Company: Instagram, Spotify, Reddit

Scenario: PostgreSQL uses streaming replication where the primary database streams WAL (Write-Ahead Log) records to replica databases asynchronously.

Implementation: Uses asynchronous streaming replication:

Why Asynchronous?

Performance: Primary doesn’t wait for replicas
Availability: Primary stays available even if replicas slow
Scalability: Can add many replicas without impacting primary
Trade-off: Small risk of data loss if primary fails before replication

Real-World Impact:

Instagram: Uses PostgreSQL async replication for billions of photos
Spotify: PostgreSQL replicas handle read traffic
Replication Lag: Typically < 1 second lag
Failover: < 30 seconds automatic failover

LLD ↔ HLD Connection

Read-After-Write Consistency Implementation

Ensure users see their own writes immediately, even with async replication:

Replication Concept	LLD Implementation
Sync Replication	Write-through cache pattern, blocking writes
Async Replication	Write-behind pattern, background jobs, message queues
Conflict Resolution	Strategy pattern for merge logic, domain-specific resolvers
Read Consistency	Strategy pattern for read routing, client-side timestamp tracking
Failover	Observer pattern for leader changes, Circuit breaker for failed replicas

Key Takeaways

What’s Next?

Now that you understand how data is replicated, let’s learn about building systems that survive failures:

Next up: Fault Tolerance & Redundancy - Learn to design systems that work even when things break.

Request a feature or report an issue