Availability Patterns

Designing systems that never sleep

What is Availability?

Availability measures how often your system is up and working. It’s the percentage of time users can successfully use your service.

The Nines of Availability

The industry measures availability in “nines” — each additional nine dramatically reduces allowed downtime:

Downtime by Availability Level

Availability	Downtime/Year	Downtime/Month	Downtime/Week	Use Case
99%	3.65 days	7.2 hours	1.68 hours	Development/Test
99.9%	8.76 hours	43.8 min	10.1 min	Standard apps
99.95%	4.38 hours	21.9 min	5 min	E-commerce
99.99%	52.6 min	4.38 min	1 min	Financial services
99.999%	5.26 min	26.3 sec	6 sec	Life-critical systems

SLI, SLO, and SLA: The Availability Triangle

Understanding these three terms is crucial for any engineer:

The Relationship

Term	Definition	Owner	Consequence of Miss
SLI	The metric you measure	Engineering	Investigation triggered
SLO	Internal target (stricter than SLA)	Engineering + Product	Team prioritizes fixes
SLA	External promise to customers	Business	Financial penalties, lost trust

Common SLIs to Track

Category	SLI	What It Measures
Availability	Success rate	% of requests that succeed
Latency	P50, P95, P99 response time	How fast responses are
Throughput	Requests per second	System capacity
Error Rate	5xx errors / total requests	Failure frequency
Saturation	CPU, memory, queue depth	How “full” the system is

High Availability Patterns

Pattern 1: Redundancy (Eliminate Single Points of Failure)

A Single Point of Failure (SPOF) is any component whose failure brings down the entire system. HA systems eliminate SPOFs through redundancy at every layer.

Redundancy Levels

Level	Description	Example
Active-Passive	Backup sits idle until needed	Secondary DB that takes over on primary failure
Active-Active	All copies handle traffic	Multiple servers behind a load balancer
N+1	Run one extra node for safety	Need 3 servers? Run 4
N+2	Extra buffer for maintenance + failure	Need 3 servers? Run 5

Pattern 2: Failover (Automatic Recovery)

When the primary fails, traffic automatically switches to the backup.

Key Failover Metrics:

Metric	Description	Typical Target
Detection Time	How quickly we notice the failure	< 10 seconds
Failover Time	How long to switch to backup	< 30 seconds
Recovery Time	Total time until service restored	< 1 minute

Pattern 3: Graceful Degradation

When parts of your system fail, continue serving users with reduced functionality rather than complete failure.

The Principle: Identify which features are core vs nice-to-have, and ensure core features work even when nice-to-haves fail.

E-commerce Example	Category	On Failure
Product info	Core	Must work — show error page if down
Recommendations	Nice-to-have	Hide section, show empty
Reviews	Nice-to-have	Hide section, show cached
Real-time inventory	Nice-to-have	Show “In Stock” (cached)
Checkout	Core	Must work — queue if payment down

Real-World Examples

Example 1: AWS S3 Availability (99.999999999% Durability)

Company: Amazon Web Services (AWS)

Scenario: AWS S3 (Simple Storage Service) stores trillions of objects for millions of customers. The service must guarantee that data is never lost, even with hardware failures, data center outages, or natural disasters.

Implementation: Uses multi-region replication and erasure coding:

Why This Works:

Erasure Coding: Can lose up to 4 fragments and still reconstruct the object
Multi-AZ: Fragments stored across multiple availability zones
Multi-Region: Complete copies in different geographic regions
Result: 99.999999999% (11 nines) durability

Real-World Impact:

Scale: Stores over 100 trillion objects
Uptime: 99.99% availability SLA
Durability: Designed for 99.999999999% (losing 1 object per 10,000 years)

Example 2: Google Search Availability (99.999% Uptime)

Company: Google

Scenario: Google Search handles billions of queries daily. Even a few minutes of downtime would impact millions of users worldwide.

Implementation: Uses massive redundancy and automatic failover:

Why This Works:

Geographic Redundancy: Multiple data centers per region
N+2 Redundancy: Always run 2 extra servers beyond capacity needs
Automatic Failover: Failed servers removed in seconds
Result: 99.999% (five nines) availability

Real-World Impact:

Queries: 8.5+ billion searches per day
Uptime: Less than 5 minutes downtime per year
Failover Time: < 30 seconds for automatic recovery

Example 3: Netflix Streaming (99.99% Availability)

Company: Netflix

Scenario: Netflix streams content to 200+ million subscribers. Service interruptions directly impact user experience and subscription retention.

Implementation: Uses CDN distribution and graceful degradation:

Why This Works:

CDN Distribution: Content cached at edge locations worldwide
Multiple CDN Providers: Fallback to different CDN if one fails
Quality Degradation: Lower bitrate if bandwidth limited
Result: 99.99% availability even during peak hours

Real-World Impact:

Peak Traffic: 15% of global internet bandwidth during peak hours
Streaming Quality: Automatic quality adjustment based on network conditions
Availability: 99.99% uptime despite massive scale

Example 4: GitHub Availability (99.95% SLA)

Company: GitHub (Microsoft)

Scenario: GitHub hosts millions of repositories and serves millions of developers. Downtime impacts productivity and developer workflows globally.

Implementation: Uses active-active replication and read replicas:

Why This Works:

Active-Active: Multiple primary databases can handle writes
Read Replicas: Distribute read load across multiple regions
Automatic Failover: Primary failure triggers automatic promotion of replica
Result: 99.95% availability SLA

Real-World Impact:

Repositories: 100+ million repositories
Users: 100+ million developers
Uptime: 99.95% SLA with credits for downtime
Failover: < 60 seconds for automatic failover

Example 5: Stripe Payment Processing (99.99% Availability)

Company: Stripe

Scenario: Stripe processes billions of dollars in payments. Payment failures directly impact merchant revenue and customer trust.

Implementation: Uses multi-region active-active architecture:

Why This Works:

Multi-Region Active-Active: Both regions can process payments
Synchronous Replication: Critical payment data replicated synchronously
Automatic Failover: < 30 seconds failover time
Result: 99.99% availability with financial guarantees

Real-World Impact:

Transaction Volume: Billions of dollars processed monthly
Uptime: 99.99% availability SLA
Failover: < 30 seconds automatic failover
Financial Guarantees: Credits for downtime exceeding SLA

LLD ↔ HLD Connection

How availability concepts affect your class design:

Failover Pattern Implementation

When a primary service fails, automatically switch to a backup:

Graceful Degradation Implementation

Continue serving users with reduced functionality when dependencies fail:

Key Takeaways

What’s Next?

Now that you understand availability concepts, let’s dive into how systems stay consistent across replicas:

Next up: Replication Strategies — Learn how data is replicated for both availability and performance.

Request a feature or report an issue