Health Checks & Heartbeats

Know you're sick before your users tell you

Why Health Checks Matter

In distributed systems, things fail silently. A service might be running but:

Database connection is broken
Memory is exhausted
Thread pool is full
External API is unreachable

Health checks detect these issues before users do.

Two Types: Liveness vs Readiness

These serve different purposes and should be implemented separately:

The Key Difference

Check	Question	Failure Response	Frequency
Liveness	”Is it alive?”	Restart container	Every 10-30s
Readiness	”Can it serve traffic?”	Remove from load balancer	Every 5-10s
Startup	”Is it initialized?”	Wait for completion	During startup

Shallow vs Deep Health Checks

What Each Check Should Test

Check Type	What to Test	Example
Shallow (Liveness)	Process alive, not deadlocked	Return 200 immediately
Database	Can connect and query	`SELECT 1` completes < 100ms
Cache	Can read and write	Write key, read back
Queue	Can connect to message broker	Check connection status
External Service	Dependency is reachable	Hit their `/health` endpoint

Implementing Health Check Endpoints

A well-designed health check system has clear separation:

Response Format

A good deep health check response:

1
{
2
  "status": "healthy",
3
  "timestamp": "2024-01-15T10:30:00Z",
4
  "version": "1.2.3",
5
  "components": {
6
    "database": {
7
      "status": "healthy",
8
      "latency_ms": 5.2
9
    },
10
    "cache": {
11
      "status": "healthy",
12
      "latency_ms": 1.1
13
    },
14
    "payment-service": {
15
      "status": "degraded",
16
      "latency_ms": 450,
17
      "message": "High latency detected"
18
    }
19
  }
20
}

Status Levels

Status	Meaning	HTTP Code
healthy	All systems go	200
degraded	Working but impaired	200 (with warning)
unhealthy	Cannot serve traffic	503

The Heartbeat Pattern

For distributed systems with multiple nodes, components need to know if their peers are alive. Heartbeats are periodic signals saying “I’m still here.”

Heartbeat State Machine

When tracking heartbeats, nodes go through states:

Heartbeat Timing

Parameter	Typical Value	Purpose
Interval	5 seconds	How often to send
Suspect Timeout	10 seconds	When to start worrying
Dead Timeout	15-30 seconds	When to declare dead

Best Practices

1. Don’t Check Dependencies in Liveness

2. Set Appropriate Timeouts

Probe	Recommended Timeout	Why
Liveness	1-5 seconds	Should be instant
Readiness	5-10 seconds	Dependencies may be slow
Startup	30-300 seconds	Initial load can take time

3. Use Proper HTTP Status Codes

Scenario	HTTP Code	Meaning
Everything healthy	200	Keep serving traffic
Degraded but working	200	Serve but alert operators
Cannot serve requests	503	Remove from load balancer
Check timed out	503	Treat as unhealthy

Real-World Examples

Example 1: Kubernetes Health Checks

Company: Google (Kubernetes), Cloud Native Computing Foundation

Scenario: Kubernetes uses liveness and readiness probes to automatically restart unhealthy pods and route traffic only to healthy instances.

Implementation: Uses three types of health checks:

Why Three Types?

Liveness: Detects deadlocks, restart if needed
Readiness: Prevents traffic to unready pods
Startup: Handles slow-starting applications
Result: Automatic recovery and traffic management

Real-World Impact:

Scale: Millions of pods managed globally
Recovery: Automatic restart in < 30 seconds
Availability: 99.9%+ pod availability

Example 2: AWS ELB Health Checks

Company: Amazon Web Services

Scenario: Elastic Load Balancer (ELB) performs health checks on EC2 instances to route traffic only to healthy instances.

Implementation: Uses periodic health checks:

Why Health Checks?

Traffic Routing: Only healthy instances receive traffic
Automatic Recovery: Re-add instances when healthy
High Availability: Survives instance failures
Result: 99.99% availability

Real-World Impact:

Scale: Millions of instances behind ELBs
Check Frequency: Every 30 seconds
Recovery: Automatic re-addition when healthy

LLD ↔ HLD Connection

Health Check Concept	LLD Implementation
Health Checker Interface	Strategy pattern — different checks for different components
Aggregated Health	Composite pattern — combining multiple checkers
State Changes	Observer pattern — notify when health changes
Heartbeat Thread	Daemon thread pattern — background periodic task
Component Checks	Dependency injection — pass dependencies for testing

Key Takeaways

What’s Next?

You’ve completed the Reliability & Availability section! You now understand:

Availability patterns and SLAs
Replication strategies
Fault tolerance techniques
Health monitoring

Continue exploring: Check out the next section on Consistency & Distributed Transactions to learn how data stays consistent across distributed systems.

Request a feature or report an issue