Replication Strategies
What is Replication?
Section titled “What is Replication?”Replication means keeping copies of the same data on multiple machines. Think of it like having backup copies of important documents in different locations—if one is destroyed, you still have access.
Why Replicate?
Section titled “Why Replicate?”| Goal | How Replication Helps |
|---|---|
| High Availability | If one node dies, others continue serving |
| Read Performance | Distribute reads across replicas |
| Latency | Place replicas closer to users geographically |
| Disaster Recovery | Replicas in different data centers survive regional failures |
Strategy 1: Leader-Follower (Primary-Replica)
Section titled “Strategy 1: Leader-Follower (Primary-Replica)”The most common replication strategy. One leader handles all writes; followers replicate from the leader and serve reads.
How It Works
Section titled “How It Works”- Client sends a write to the leader
- Leader persists the data locally
- Leader sends data to all followers
- Followers apply the changes to their copies
- Reads can go to any follower (or the leader)
Synchronous vs Asynchronous Replication
Section titled “Synchronous vs Asynchronous Replication”The critical question: Does the leader wait for followers before confirming the write?
Synchronous: The “Safe” Option
Section titled “Synchronous: The “Safe” Option”How it works: Leader waits until ALL (or some) followers confirm they received the data before telling the client “write successful.”
Like: Sending a registered letter—you wait for delivery confirmation.
Trade-off: If any follower is slow or down, the entire write is delayed or blocked.
Asynchronous: The “Fast” Option
Section titled “Asynchronous: The “Fast” Option”How it works: Leader immediately tells client “write successful,” then sends data to followers in the background.
Like: Sending a regular letter—you drop it in the mailbox and assume it’ll arrive.
Trade-off: If the leader crashes before replication completes, those writes are lost forever.
Strategy 2: Multi-Leader Replication
Section titled “Strategy 2: Multi-Leader Replication”Multiple nodes can accept writes. Each leader replicates to others. This is common for geo-distributed systems.
When to Use Multi-Leader
Section titled “When to Use Multi-Leader”- Multi-datacenter deployment — Users in US write to US leader, EU users write to EU leader
- Offline clients — Mobile apps that work offline (each device is a “leader”)
- Collaborative editing — Google Docs-style real-time collaboration
The Conflict Problem
Section titled “The Conflict Problem”What happens when two leaders accept conflicting writes at the same time?
Imagine this scenario:
- User A in the US changes their username to “alice_new”
- User B in the EU changes the same username to “alice_updated”
- Both leaders accept the write locally
- When they sync… which one wins?
Conflict Resolution Strategies
Section titled “Conflict Resolution Strategies”| Strategy | How It Works | Best For |
|---|---|---|
| Last-Write-Wins (LWW) | Most recent timestamp wins; older write is discarded | Simple data, acceptable to lose updates |
| First-Write-Wins | First timestamp wins; reject later writes | Immutable records |
| Merge | Combine both values using domain logic | Shopping carts, sets, counters |
| Custom/User Resolution | Store both, let app or user decide | Documents, complex data |
Strategy 3: Leaderless Replication
Section titled “Strategy 3: Leaderless Replication”No designated leader — any node can accept reads and writes. Used by systems like Cassandra, DynamoDB, and Riak.
Quorum: The Magic Formula
Section titled “Quorum: The Magic Formula”The key concept is quorum — a voting system for consistency:
- N = Total number of replicas
- W = Number of nodes that must confirm a write
- R = Number of nodes that must respond to a read
The Rule: If W + R > N, you’re guaranteed to read at least one node with the latest data.
Common Quorum Configurations
Section titled “Common Quorum Configurations”| Config | W | R | Trade-off |
|---|---|---|---|
| Balanced | 2 | 2 | Good consistency + availability |
| Write-heavy | 1 | 3 | Fast writes, slower reads |
| Read-heavy | 3 | 1 | Slow writes, fast reads |
Handling Replication Lag
Section titled “Handling Replication Lag”With async replication, followers may be behind. This creates read consistency challenges:
Read Consistency Levels
Section titled “Read Consistency Levels”| Level | What It Guarantees | How To Achieve |
|---|---|---|
| Eventual | Data will sync “eventually” | Read from any replica |
| Read-Your-Writes | See your own writes immediately | Read from leader after write, or track write timestamps |
| Monotonic Reads | Never see older data than before | Stick to same replica, or track read positions |
| Strong/Linearizable | Always see latest | Read from leader only |
Comparing All Three Strategies
Section titled “Comparing All Three Strategies”| Aspect | Leader-Follower | Multi-Leader | Leaderless |
|---|---|---|---|
| Write Scalability | Limited (1 leader) | Good (multiple leaders) | Best (any node) |
| Read Scalability | Good (add followers) | Good | Good |
| Consistency | Easier to achieve | Conflict resolution needed | Quorum-based |
| Latency (geo) | High (single leader) | Low (local leaders) | Low |
| Complexity | Simplest | Complex (conflicts) | Complex (quorums) |
| Examples | MySQL, PostgreSQL | CouchDB, Google Docs | Cassandra, DynamoDB |
Real-World Examples
Section titled “Real-World Examples”Example 1: MySQL Leader-Follower Replication (Synchronous)
Section titled “Example 1: MySQL Leader-Follower Replication (Synchronous)”Company: Facebook, GitHub, WordPress
Scenario: MySQL uses leader-follower replication where one primary database handles all writes and multiple replica databases handle reads. This provides high availability and read scalability.
Implementation: Uses semi-synchronous replication by default:
Why Leader-Follower?
- Simplicity: Single source of truth for writes
- Consistency: Easier to maintain strong consistency
- Read Scalability: Add replicas to scale reads horizontally
- Failover: Automatic promotion of replica to primary on failure
Real-World Impact:
- Facebook: Uses MySQL with leader-follower for billions of users
- GitHub: MySQL replicas handle read traffic, reducing primary load
- Performance: Read queries distributed across replicas, 10x read capacity
Example 2: Google Docs Multi-Leader Replication
Section titled “Example 2: Google Docs Multi-Leader Replication”Company: Google
Scenario: Google Docs allows multiple users to edit the same document simultaneously. Each user’s browser acts as a “leader” that can accept writes, and changes are synchronized across all clients.
Implementation: Uses multi-leader replication with operational transformation:
Why Multi-Leader?
- Low Latency: Users write to nearest server
- Offline Support: Works offline, syncs when online
- Collaboration: Multiple simultaneous editors
- Conflict Resolution: Operational transformation merges edits
Real-World Impact:
- Scale: Supports 50+ simultaneous editors per document
- Latency: < 100ms write latency (local server)
- Consistency: All users see same document within seconds
Example 3: Amazon DynamoDB Leaderless Replication
Section titled “Example 3: Amazon DynamoDB Leaderless Replication”Company: Amazon
Scenario: DynamoDB is a NoSQL database that uses leaderless replication. Any node can accept reads and writes, providing high availability and low latency.
Implementation: Uses quorum-based leaderless replication:
Why Leaderless?
- No Single Point of Failure: No leader to fail
- High Availability: System works even if nodes fail
- Low Latency: Write to nearest nodes
- Scalability: Add nodes to increase capacity
Real-World Impact:
- Scale: Handles millions of requests per second
- Availability: 99.99% uptime SLA
- Latency: Single-digit millisecond latency
- Durability: 99.999999999% (11 nines) durability
Example 4: PostgreSQL Streaming Replication (Asynchronous)
Section titled “Example 4: PostgreSQL Streaming Replication (Asynchronous)”Company: Instagram, Spotify, Reddit
Scenario: PostgreSQL uses streaming replication where the primary database streams WAL (Write-Ahead Log) records to replica databases asynchronously.
Implementation: Uses asynchronous streaming replication:
Why Asynchronous?
- Performance: Primary doesn’t wait for replicas
- Availability: Primary stays available even if replicas slow
- Scalability: Can add many replicas without impacting primary
- Trade-off: Small risk of data loss if primary fails before replication
Real-World Impact:
- Instagram: Uses PostgreSQL async replication for billions of photos
- Spotify: PostgreSQL replicas handle read traffic
- Replication Lag: Typically < 1 second lag
- Failover: < 30 seconds automatic failover
LLD ↔ HLD Connection
Section titled “LLD ↔ HLD Connection”Read-After-Write Consistency Implementation
Section titled “Read-After-Write Consistency Implementation”Ensure users see their own writes immediately, even with async replication:
| Replication Concept | LLD Implementation |
|---|---|
| Sync Replication | Write-through cache pattern, blocking writes |
| Async Replication | Write-behind pattern, background jobs, message queues |
| Conflict Resolution | Strategy pattern for merge logic, domain-specific resolvers |
| Read Consistency | Strategy pattern for read routing, client-side timestamp tracking |
| Failover | Observer pattern for leader changes, Circuit breaker for failed replicas |
Key Takeaways
Section titled “Key Takeaways”What’s Next?
Section titled “What’s Next?”Now that you understand how data is replicated, let’s learn about building systems that survive failures:
Next up: Fault Tolerance & Redundancy - Learn to design systems that work even when things break.