Fault Tolerance & Redundancy
Embrace Failure
Section titled “Embrace Failure”In distributed systems, failure is not exceptional — it’s the norm. Networks partition, servers crash, disks fail, and processes get killed. The question isn’t if things will fail, but when.
Types of Failures
Section titled “Types of Failures”Understanding failure modes helps you design appropriate responses:
| Failure Type | What Happens | How Common | Solution |
|---|---|---|---|
| Crash | Component stops completely | Very common | Redundancy, auto-restart |
| Omission | Messages lost or not delivered | Common | Retries, acknowledgments |
| Timing | Response slower than acceptable | Common | Timeouts, SLOs |
| Byzantine | Component behaves incorrectly | Rare | Consensus protocols |
Pattern 1: Redundancy
Section titled “Pattern 1: Redundancy”Eliminate every single point of failure by having backups for everything:
Redundancy Levels
Section titled “Redundancy Levels”| Level | Description | Example |
|---|---|---|
| Active-Passive | Backup sits idle until needed | Secondary database that takes over on primary failure |
| Active-Active | All copies handle traffic | Multiple servers behind load balancer |
| N+1 | N nodes needed, have N+1 | Need 3 servers? Run 4 |
| N+2 | Extra buffer for maintenance + failure | Need 3 servers? Run 5 |
Pattern 2: Retry with Exponential Backoff
Section titled “Pattern 2: Retry with Exponential Backoff”When operations fail, retry — but do it smartly. Immediate retries can overwhelm a struggling service.
Why Exponential Backoff?
Section titled “Why Exponential Backoff?”| Strategy | Delay Pattern | Problem |
|---|---|---|
| No Retry | — | One failure = complete failure |
| Immediate Retry | 0s, 0s, 0s… | Hammers failing service, makes it worse |
| Constant Delay | 1s, 1s, 1s… | May still overwhelm recovering service |
| Exponential Backoff | 1s, 2s, 4s, 8s… | ✅ Gives service time to recover |
| Exponential + Jitter | 1.2s, 2.7s, 4.1s… | ✅ Prevents thundering herd |
The Thundering Herd Problem
Section titled “The Thundering Herd Problem”When a service goes down and comes back up, thousands of clients retry at the exact same moment — overwhelming the service again. Jitter adds randomness to retry times, spreading out the load.
Pattern 3: The Bulkhead Pattern
Section titled “Pattern 3: The Bulkhead Pattern”Inspired by ship compartments that prevent a leak from sinking the entire ship. Isolate resources so one failing component can’t consume all resources.
Bulkhead in Practice
Section titled “Bulkhead in Practice”The Problem: You have a shared connection pool of 100 connections. Your payment service gets slow (3rd party issue). All 100 connections get stuck waiting for payment responses. Now your entire application is frozen — including checkout, browsing, and search.
The Solution: Separate pools per dependency:
- Payment: 20 connections max
- Inventory: 30 connections max
- Search: 50 connections max
If payment slows down, only 20 connections are affected. Everything else keeps working.
Pattern 4: Timeout Everything
Section titled “Pattern 4: Timeout Everything”Never wait indefinitely. Every external call should have a timeout. Hanging requests consume resources and cause cascading failures.
Timeout Guidelines
Section titled “Timeout Guidelines”| Operation Type | Recommended Timeout | Why |
|---|---|---|
| In-memory cache | 10-50ms | Should be instant |
| Database query | 1-5s | Longer = wrong query or lock |
| Internal service | 1-3s | Should be fast |
| External API | 5-30s | Less control, may be slow |
| File upload | 30-120s | Depends on size |
Pattern 5: Fail Fast
Section titled “Pattern 5: Fail Fast”When you know something won’t work, fail immediately instead of wasting time and resources.
Fail Fast Checklist
Section titled “Fail Fast Checklist”Before doing expensive work, check:
- Input Validation — Reject bad requests immediately
- Capacity Check — Are we at limit? Reject with 503
- Dependency Health — Is the service we need alive?
- Feature Flags — Is this feature enabled?
Pattern Summary
Section titled “Pattern Summary”| Pattern | What It Does | When to Use |
|---|---|---|
| Redundancy | Multiple copies of components | Always for critical paths |
| Retry + Backoff | Retry failed ops with increasing delays | Transient failures |
| Bulkhead | Isolate resources per component | Multi-dependency systems |
| Timeout | Set max wait time for operations | Every external call |
| Fail Fast | Reject early if we know it will fail | Expensive operations |
LLD ↔ HLD Connection
Section titled “LLD ↔ HLD Connection”| Fault Tolerance Concept | LLD Implementation |
|---|---|
| Redundancy | Multiple service instances, object pools |
| Retry Logic | Strategy pattern for retry policies, decorator pattern |
| Bulkhead | Thread pools with limits, semaphores, connection pools |
| Timeout | Every external call wrapped with timeout config |
| Fail Fast | Guard clauses, precondition checks at method start |
| Circuit Breaker | State machine pattern (covered in Resiliency section) |
Key Takeaways
Section titled “Key Takeaways”What’s Next?
Section titled “What’s Next?”We’ve covered how to handle failures. Now let’s learn how to monitor system health proactively:
Next up: Health Checks & Heartbeats — Learn to detect failures before they impact users.