Skip to content
Low Level Design Mastery Logo
LowLevelDesign Mastery

Availability Patterns

Designing systems that never sleep

Availability measures how often your system is up and working. It’s the percentage of time users can successfully use your service.

Diagram

The industry measures availability in “nines” — each additional nine dramatically reduces allowed downtime:

Diagram
AvailabilityDowntime/YearDowntime/MonthDowntime/WeekUse Case
99%3.65 days7.2 hours1.68 hoursDevelopment/Test
99.9%8.76 hours43.8 min10.1 minStandard apps
99.95%4.38 hours21.9 min5 minE-commerce
99.99%52.6 min4.38 min1 minFinancial services
99.999%5.26 min26.3 sec6 secLife-critical systems

SLI, SLO, and SLA: The Availability Triangle

Section titled “SLI, SLO, and SLA: The Availability Triangle”

Understanding these three terms is crucial for any engineer:

Diagram
TermDefinitionOwnerConsequence of Miss
SLIThe metric you measureEngineeringInvestigation triggered
SLOInternal target (stricter than SLA)Engineering + ProductTeam prioritizes fixes
SLAExternal promise to customersBusinessFinancial penalties, lost trust
CategorySLIWhat It Measures
AvailabilitySuccess rate% of requests that succeed
LatencyP50, P95, P99 response timeHow fast responses are
ThroughputRequests per secondSystem capacity
Error Rate5xx errors / total requestsFailure frequency
SaturationCPU, memory, queue depthHow “full” the system is

Pattern 1: Redundancy (Eliminate Single Points of Failure)

Section titled “Pattern 1: Redundancy (Eliminate Single Points of Failure)”

A Single Point of Failure (SPOF) is any component whose failure brings down the entire system. HA systems eliminate SPOFs through redundancy at every layer.

Diagram
LevelDescriptionExample
Active-PassiveBackup sits idle until neededSecondary DB that takes over on primary failure
Active-ActiveAll copies handle trafficMultiple servers behind a load balancer
N+1Run one extra node for safetyNeed 3 servers? Run 4
N+2Extra buffer for maintenance + failureNeed 3 servers? Run 5

When the primary fails, traffic automatically switches to the backup.

Diagram

Key Failover Metrics:

MetricDescriptionTypical Target
Detection TimeHow quickly we notice the failure< 10 seconds
Failover TimeHow long to switch to backup< 30 seconds
Recovery TimeTotal time until service restored< 1 minute

When parts of your system fail, continue serving users with reduced functionality rather than complete failure.

Diagram

The Principle: Identify which features are core vs nice-to-have, and ensure core features work even when nice-to-haves fail.

E-commerce ExampleCategoryOn Failure
Product infoCoreMust work — show error page if down
RecommendationsNice-to-haveHide section, show empty
ReviewsNice-to-haveHide section, show cached
Real-time inventoryNice-to-haveShow “In Stock” (cached)
CheckoutCoreMust work — queue if payment down

Example 1: AWS S3 Availability (99.999999999% Durability)

Section titled “Example 1: AWS S3 Availability (99.999999999% Durability)”

Company: Amazon Web Services (AWS)

Scenario: AWS S3 (Simple Storage Service) stores trillions of objects for millions of customers. The service must guarantee that data is never lost, even with hardware failures, data center outages, or natural disasters.

Implementation: Uses multi-region replication and erasure coding:

Diagram

Why This Works:

  • Erasure Coding: Can lose up to 4 fragments and still reconstruct the object
  • Multi-AZ: Fragments stored across multiple availability zones
  • Multi-Region: Complete copies in different geographic regions
  • Result: 99.999999999% (11 nines) durability

Real-World Impact:

  • Scale: Stores over 100 trillion objects
  • Uptime: 99.99% availability SLA
  • Durability: Designed for 99.999999999% (losing 1 object per 10,000 years)

Example 2: Google Search Availability (99.999% Uptime)

Section titled “Example 2: Google Search Availability (99.999% Uptime)”

Company: Google

Scenario: Google Search handles billions of queries daily. Even a few minutes of downtime would impact millions of users worldwide.

Implementation: Uses massive redundancy and automatic failover:

Diagram

Why This Works:

  • Geographic Redundancy: Multiple data centers per region
  • N+2 Redundancy: Always run 2 extra servers beyond capacity needs
  • Automatic Failover: Failed servers removed in seconds
  • Result: 99.999% (five nines) availability

Real-World Impact:

  • Queries: 8.5+ billion searches per day
  • Uptime: Less than 5 minutes downtime per year
  • Failover Time: < 30 seconds for automatic recovery

Example 3: Netflix Streaming (99.99% Availability)

Section titled “Example 3: Netflix Streaming (99.99% Availability)”

Company: Netflix

Scenario: Netflix streams content to 200+ million subscribers. Service interruptions directly impact user experience and subscription retention.

Implementation: Uses CDN distribution and graceful degradation:

Diagram

Why This Works:

  • CDN Distribution: Content cached at edge locations worldwide
  • Multiple CDN Providers: Fallback to different CDN if one fails
  • Quality Degradation: Lower bitrate if bandwidth limited
  • Result: 99.99% availability even during peak hours

Real-World Impact:

  • Peak Traffic: 15% of global internet bandwidth during peak hours
  • Streaming Quality: Automatic quality adjustment based on network conditions
  • Availability: 99.99% uptime despite massive scale

Example 4: GitHub Availability (99.95% SLA)

Section titled “Example 4: GitHub Availability (99.95% SLA)”

Company: GitHub (Microsoft)

Scenario: GitHub hosts millions of repositories and serves millions of developers. Downtime impacts productivity and developer workflows globally.

Implementation: Uses active-active replication and read replicas:

Diagram

Why This Works:

  • Active-Active: Multiple primary databases can handle writes
  • Read Replicas: Distribute read load across multiple regions
  • Automatic Failover: Primary failure triggers automatic promotion of replica
  • Result: 99.95% availability SLA

Real-World Impact:

  • Repositories: 100+ million repositories
  • Users: 100+ million developers
  • Uptime: 99.95% SLA with credits for downtime
  • Failover: < 60 seconds for automatic failover

Example 5: Stripe Payment Processing (99.99% Availability)

Section titled “Example 5: Stripe Payment Processing (99.99% Availability)”

Company: Stripe

Scenario: Stripe processes billions of dollars in payments. Payment failures directly impact merchant revenue and customer trust.

Implementation: Uses multi-region active-active architecture:

Diagram

Why This Works:

  • Multi-Region Active-Active: Both regions can process payments
  • Synchronous Replication: Critical payment data replicated synchronously
  • Automatic Failover: < 30 seconds failover time
  • Result: 99.99% availability with financial guarantees

Real-World Impact:

  • Transaction Volume: Billions of dollars processed monthly
  • Uptime: 99.99% availability SLA
  • Failover: < 30 seconds automatic failover
  • Financial Guarantees: Credits for downtime exceeding SLA

How availability concepts affect your class design:

When a primary service fails, automatically switch to a backup:

Continue serving users with reduced functionality when dependencies fail:



Now that you understand availability concepts, let’s dive into how systems stay consistent across replicas:

Next up: Replication Strategies — Learn how data is replicated for both availability and performance.