Skip to content
Low Level Design Mastery Logo
LowLevelDesign Mastery

Health Checks & Heartbeats

Know you're sick before your users tell you

In distributed systems, things fail silently. A service might be running but:

  • Database connection is broken
  • Memory is exhausted
  • Thread pool is full
  • External API is unreachable

Health checks detect these issues before users do.

Diagram

These serve different purposes and should be implemented separately:

Diagram
CheckQuestionFailure ResponseFrequency
Liveness”Is it alive?”Restart containerEvery 10-30s
Readiness”Can it serve traffic?”Remove from load balancerEvery 5-10s
Startup”Is it initialized?”Wait for completionDuring startup

Diagram
Check TypeWhat to TestExample
Shallow (Liveness)Process alive, not deadlockedReturn 200 immediately
DatabaseCan connect and querySELECT 1 completes < 100ms
CacheCan read and writeWrite key, read back
QueueCan connect to message brokerCheck connection status
External ServiceDependency is reachableHit their /health endpoint

A well-designed health check system has clear separation:

Diagram

A good deep health check response:

{
"status": "healthy",
"timestamp": "2024-01-15T10:30:00Z",
"version": "1.2.3",
"components": {
"database": {
"status": "healthy",
"latency_ms": 5.2
},
"cache": {
"status": "healthy",
"latency_ms": 1.1
},
"payment-service": {
"status": "degraded",
"latency_ms": 450,
"message": "High latency detected"
}
}
}
StatusMeaningHTTP Code
healthyAll systems go200
degradedWorking but impaired200 (with warning)
unhealthyCannot serve traffic503

For distributed systems with multiple nodes, components need to know if their peers are alive. Heartbeats are periodic signals saying “I’m still here.”

Diagram

When tracking heartbeats, nodes go through states:

Diagram
ParameterTypical ValuePurpose
Interval5 secondsHow often to send
Suspect Timeout10 secondsWhen to start worrying
Dead Timeout15-30 secondsWhen to declare dead

Diagram
ProbeRecommended TimeoutWhy
Liveness1-5 secondsShould be instant
Readiness5-10 secondsDependencies may be slow
Startup30-300 secondsInitial load can take time
ScenarioHTTP CodeMeaning
Everything healthy200Keep serving traffic
Degraded but working200Serve but alert operators
Cannot serve requests503Remove from load balancer
Check timed out503Treat as unhealthy

Company: Google (Kubernetes), Cloud Native Computing Foundation

Scenario: Kubernetes uses liveness and readiness probes to automatically restart unhealthy pods and route traffic only to healthy instances.

Implementation: Uses three types of health checks:

Diagram

Why Three Types?

  • Liveness: Detects deadlocks, restart if needed
  • Readiness: Prevents traffic to unready pods
  • Startup: Handles slow-starting applications
  • Result: Automatic recovery and traffic management

Real-World Impact:

  • Scale: Millions of pods managed globally
  • Recovery: Automatic restart in < 30 seconds
  • Availability: 99.9%+ pod availability

Company: Amazon Web Services

Scenario: Elastic Load Balancer (ELB) performs health checks on EC2 instances to route traffic only to healthy instances.

Implementation: Uses periodic health checks:

Diagram

Why Health Checks?

  • Traffic Routing: Only healthy instances receive traffic
  • Automatic Recovery: Re-add instances when healthy
  • High Availability: Survives instance failures
  • Result: 99.99% availability

Real-World Impact:

  • Scale: Millions of instances behind ELBs
  • Check Frequency: Every 30 seconds
  • Recovery: Automatic re-addition when healthy

Health Check ConceptLLD Implementation
Health Checker InterfaceStrategy pattern — different checks for different components
Aggregated HealthComposite pattern — combining multiple checkers
State ChangesObserver pattern — notify when health changes
Heartbeat ThreadDaemon thread pattern — background periodic task
Component ChecksDependency injection — pass dependencies for testing


You’ve completed the Reliability & Availability section! You now understand:

  • Availability patterns and SLAs
  • Replication strategies
  • Fault tolerance techniques
  • Health monitoring

Continue exploring: Check out the next section on Consistency & Distributed Transactions to learn how data stays consistent across distributed systems.