Availability Patterns
What is Availability?
Section titled “What is Availability?”Availability measures how often your system is up and working. It’s the percentage of time users can successfully use your service.
The Nines of Availability
Section titled “The Nines of Availability”The industry measures availability in “nines” — each additional nine dramatically reduces allowed downtime:
Downtime by Availability Level
Section titled “Downtime by Availability Level”| Availability | Downtime/Year | Downtime/Month | Downtime/Week | Use Case |
|---|---|---|---|---|
| 99% | 3.65 days | 7.2 hours | 1.68 hours | Development/Test |
| 99.9% | 8.76 hours | 43.8 min | 10.1 min | Standard apps |
| 99.95% | 4.38 hours | 21.9 min | 5 min | E-commerce |
| 99.99% | 52.6 min | 4.38 min | 1 min | Financial services |
| 99.999% | 5.26 min | 26.3 sec | 6 sec | Life-critical systems |
SLI, SLO, and SLA: The Availability Triangle
Section titled “SLI, SLO, and SLA: The Availability Triangle”Understanding these three terms is crucial for any engineer:
The Relationship
Section titled “The Relationship”| Term | Definition | Owner | Consequence of Miss |
|---|---|---|---|
| SLI | The metric you measure | Engineering | Investigation triggered |
| SLO | Internal target (stricter than SLA) | Engineering + Product | Team prioritizes fixes |
| SLA | External promise to customers | Business | Financial penalties, lost trust |
Common SLIs to Track
Section titled “Common SLIs to Track”| Category | SLI | What It Measures |
|---|---|---|
| Availability | Success rate | % of requests that succeed |
| Latency | P50, P95, P99 response time | How fast responses are |
| Throughput | Requests per second | System capacity |
| Error Rate | 5xx errors / total requests | Failure frequency |
| Saturation | CPU, memory, queue depth | How “full” the system is |
High Availability Patterns
Section titled “High Availability Patterns”Pattern 1: Redundancy (Eliminate Single Points of Failure)
Section titled “Pattern 1: Redundancy (Eliminate Single Points of Failure)”A Single Point of Failure (SPOF) is any component whose failure brings down the entire system. HA systems eliminate SPOFs through redundancy at every layer.
Redundancy Levels
Section titled “Redundancy Levels”| Level | Description | Example |
|---|---|---|
| Active-Passive | Backup sits idle until needed | Secondary DB that takes over on primary failure |
| Active-Active | All copies handle traffic | Multiple servers behind a load balancer |
| N+1 | Run one extra node for safety | Need 3 servers? Run 4 |
| N+2 | Extra buffer for maintenance + failure | Need 3 servers? Run 5 |
Pattern 2: Failover (Automatic Recovery)
Section titled “Pattern 2: Failover (Automatic Recovery)”When the primary fails, traffic automatically switches to the backup.
Key Failover Metrics:
| Metric | Description | Typical Target |
|---|---|---|
| Detection Time | How quickly we notice the failure | < 10 seconds |
| Failover Time | How long to switch to backup | < 30 seconds |
| Recovery Time | Total time until service restored | < 1 minute |
Pattern 3: Graceful Degradation
Section titled “Pattern 3: Graceful Degradation”When parts of your system fail, continue serving users with reduced functionality rather than complete failure.
The Principle: Identify which features are core vs nice-to-have, and ensure core features work even when nice-to-haves fail.
| E-commerce Example | Category | On Failure |
|---|---|---|
| Product info | Core | Must work — show error page if down |
| Recommendations | Nice-to-have | Hide section, show empty |
| Reviews | Nice-to-have | Hide section, show cached |
| Real-time inventory | Nice-to-have | Show “In Stock” (cached) |
| Checkout | Core | Must work — queue if payment down |
LLD ↔ HLD Connection
Section titled “LLD ↔ HLD Connection”How availability concepts affect your class design:
| Availability Concept | LLD Implementation |
|---|---|
| Redundancy | Multiple service client instances, failover logic |
| Failover | Strategy pattern for switching between providers |
| Graceful Degradation | Facade pattern with fallback methods, Optional returns |
| Health Checks | Implementing health check interfaces on classes |
| SLI Tracking | Decorator pattern for measuring method performance |
| Timeouts | Configurable timeouts in service clients |
Key Takeaways
Section titled “Key Takeaways”What’s Next?
Section titled “What’s Next?”Now that you understand availability concepts, let’s dive into how systems stay consistent across replicas:
Next up: Replication Strategies — Learn how data is replicated for both availability and performance.