Availability Patterns
What is Availability?
Section titled “What is Availability?”Availability measures how often your system is up and working. It’s the percentage of time users can successfully use your service.
The Nines of Availability
Section titled “The Nines of Availability”The industry measures availability in “nines” — each additional nine dramatically reduces allowed downtime:
Downtime by Availability Level
Section titled “Downtime by Availability Level”| Availability | Downtime/Year | Downtime/Month | Downtime/Week | Use Case |
|---|---|---|---|---|
| 99% | 3.65 days | 7.2 hours | 1.68 hours | Development/Test |
| 99.9% | 8.76 hours | 43.8 min | 10.1 min | Standard apps |
| 99.95% | 4.38 hours | 21.9 min | 5 min | E-commerce |
| 99.99% | 52.6 min | 4.38 min | 1 min | Financial services |
| 99.999% | 5.26 min | 26.3 sec | 6 sec | Life-critical systems |
SLI, SLO, and SLA: The Availability Triangle
Section titled “SLI, SLO, and SLA: The Availability Triangle”Understanding these three terms is crucial for any engineer:
The Relationship
Section titled “The Relationship”| Term | Definition | Owner | Consequence of Miss |
|---|---|---|---|
| SLI | The metric you measure | Engineering | Investigation triggered |
| SLO | Internal target (stricter than SLA) | Engineering + Product | Team prioritizes fixes |
| SLA | External promise to customers | Business | Financial penalties, lost trust |
Common SLIs to Track
Section titled “Common SLIs to Track”| Category | SLI | What It Measures |
|---|---|---|
| Availability | Success rate | % of requests that succeed |
| Latency | P50, P95, P99 response time | How fast responses are |
| Throughput | Requests per second | System capacity |
| Error Rate | 5xx errors / total requests | Failure frequency |
| Saturation | CPU, memory, queue depth | How “full” the system is |
High Availability Patterns
Section titled “High Availability Patterns”Pattern 1: Redundancy (Eliminate Single Points of Failure)
Section titled “Pattern 1: Redundancy (Eliminate Single Points of Failure)”A Single Point of Failure (SPOF) is any component whose failure brings down the entire system. HA systems eliminate SPOFs through redundancy at every layer.
Redundancy Levels
Section titled “Redundancy Levels”| Level | Description | Example |
|---|---|---|
| Active-Passive | Backup sits idle until needed | Secondary DB that takes over on primary failure |
| Active-Active | All copies handle traffic | Multiple servers behind a load balancer |
| N+1 | Run one extra node for safety | Need 3 servers? Run 4 |
| N+2 | Extra buffer for maintenance + failure | Need 3 servers? Run 5 |
Pattern 2: Failover (Automatic Recovery)
Section titled “Pattern 2: Failover (Automatic Recovery)”When the primary fails, traffic automatically switches to the backup.
Key Failover Metrics:
| Metric | Description | Typical Target |
|---|---|---|
| Detection Time | How quickly we notice the failure | < 10 seconds |
| Failover Time | How long to switch to backup | < 30 seconds |
| Recovery Time | Total time until service restored | < 1 minute |
Pattern 3: Graceful Degradation
Section titled “Pattern 3: Graceful Degradation”When parts of your system fail, continue serving users with reduced functionality rather than complete failure.
The Principle: Identify which features are core vs nice-to-have, and ensure core features work even when nice-to-haves fail.
| E-commerce Example | Category | On Failure |
|---|---|---|
| Product info | Core | Must work — show error page if down |
| Recommendations | Nice-to-have | Hide section, show empty |
| Reviews | Nice-to-have | Hide section, show cached |
| Real-time inventory | Nice-to-have | Show “In Stock” (cached) |
| Checkout | Core | Must work — queue if payment down |
Real-World Examples
Section titled “Real-World Examples”Example 1: AWS S3 Availability (99.999999999% Durability)
Section titled “Example 1: AWS S3 Availability (99.999999999% Durability)”Company: Amazon Web Services (AWS)
Scenario: AWS S3 (Simple Storage Service) stores trillions of objects for millions of customers. The service must guarantee that data is never lost, even with hardware failures, data center outages, or natural disasters.
Implementation: Uses multi-region replication and erasure coding:
Why This Works:
- Erasure Coding: Can lose up to 4 fragments and still reconstruct the object
- Multi-AZ: Fragments stored across multiple availability zones
- Multi-Region: Complete copies in different geographic regions
- Result: 99.999999999% (11 nines) durability
Real-World Impact:
- Scale: Stores over 100 trillion objects
- Uptime: 99.99% availability SLA
- Durability: Designed for 99.999999999% (losing 1 object per 10,000 years)
Example 2: Google Search Availability (99.999% Uptime)
Section titled “Example 2: Google Search Availability (99.999% Uptime)”Company: Google
Scenario: Google Search handles billions of queries daily. Even a few minutes of downtime would impact millions of users worldwide.
Implementation: Uses massive redundancy and automatic failover:
Why This Works:
- Geographic Redundancy: Multiple data centers per region
- N+2 Redundancy: Always run 2 extra servers beyond capacity needs
- Automatic Failover: Failed servers removed in seconds
- Result: 99.999% (five nines) availability
Real-World Impact:
- Queries: 8.5+ billion searches per day
- Uptime: Less than 5 minutes downtime per year
- Failover Time: < 30 seconds for automatic recovery
Example 3: Netflix Streaming (99.99% Availability)
Section titled “Example 3: Netflix Streaming (99.99% Availability)”Company: Netflix
Scenario: Netflix streams content to 200+ million subscribers. Service interruptions directly impact user experience and subscription retention.
Implementation: Uses CDN distribution and graceful degradation:
Why This Works:
- CDN Distribution: Content cached at edge locations worldwide
- Multiple CDN Providers: Fallback to different CDN if one fails
- Quality Degradation: Lower bitrate if bandwidth limited
- Result: 99.99% availability even during peak hours
Real-World Impact:
- Peak Traffic: 15% of global internet bandwidth during peak hours
- Streaming Quality: Automatic quality adjustment based on network conditions
- Availability: 99.99% uptime despite massive scale
Example 4: GitHub Availability (99.95% SLA)
Section titled “Example 4: GitHub Availability (99.95% SLA)”Company: GitHub (Microsoft)
Scenario: GitHub hosts millions of repositories and serves millions of developers. Downtime impacts productivity and developer workflows globally.
Implementation: Uses active-active replication and read replicas:
Why This Works:
- Active-Active: Multiple primary databases can handle writes
- Read Replicas: Distribute read load across multiple regions
- Automatic Failover: Primary failure triggers automatic promotion of replica
- Result: 99.95% availability SLA
Real-World Impact:
- Repositories: 100+ million repositories
- Users: 100+ million developers
- Uptime: 99.95% SLA with credits for downtime
- Failover: < 60 seconds for automatic failover
Example 5: Stripe Payment Processing (99.99% Availability)
Section titled “Example 5: Stripe Payment Processing (99.99% Availability)”Company: Stripe
Scenario: Stripe processes billions of dollars in payments. Payment failures directly impact merchant revenue and customer trust.
Implementation: Uses multi-region active-active architecture:
Why This Works:
- Multi-Region Active-Active: Both regions can process payments
- Synchronous Replication: Critical payment data replicated synchronously
- Automatic Failover: < 30 seconds failover time
- Result: 99.99% availability with financial guarantees
Real-World Impact:
- Transaction Volume: Billions of dollars processed monthly
- Uptime: 99.99% availability SLA
- Failover: < 30 seconds automatic failover
- Financial Guarantees: Credits for downtime exceeding SLA
LLD ↔ HLD Connection
Section titled “LLD ↔ HLD Connection”How availability concepts affect your class design:
Failover Pattern Implementation
Section titled “Failover Pattern Implementation”When a primary service fails, automatically switch to a backup:
Graceful Degradation Implementation
Section titled “Graceful Degradation Implementation”Continue serving users with reduced functionality when dependencies fail:
Key Takeaways
Section titled “Key Takeaways”What’s Next?
Section titled “What’s Next?”Now that you understand availability concepts, let’s dive into how systems stay consistent across replicas:
Next up: Replication Strategies — Learn how data is replicated for both availability and performance.