Health Checks & Heartbeats
Why Health Checks Matter
Section titled “Why Health Checks Matter”In distributed systems, things fail silently. A service might be running but:
- Database connection is broken
- Memory is exhausted
- Thread pool is full
- External API is unreachable
Health checks detect these issues before users do.
Two Types: Liveness vs Readiness
Section titled “Two Types: Liveness vs Readiness”These serve different purposes and should be implemented separately:
The Key Difference
Section titled “The Key Difference”| Check | Question | Failure Response | Frequency |
|---|---|---|---|
| Liveness | ”Is it alive?” | Restart container | Every 10-30s |
| Readiness | ”Can it serve traffic?” | Remove from load balancer | Every 5-10s |
| Startup | ”Is it initialized?” | Wait for completion | During startup |
Shallow vs Deep Health Checks
Section titled “Shallow vs Deep Health Checks”What Each Check Should Test
Section titled “What Each Check Should Test”| Check Type | What to Test | Example |
|---|---|---|
| Shallow (Liveness) | Process alive, not deadlocked | Return 200 immediately |
| Database | Can connect and query | SELECT 1 completes < 100ms |
| Cache | Can read and write | Write key, read back |
| Queue | Can connect to message broker | Check connection status |
| External Service | Dependency is reachable | Hit their /health endpoint |
Implementing Health Check Endpoints
Section titled “Implementing Health Check Endpoints”A well-designed health check system has clear separation:
Response Format
Section titled “Response Format”A good deep health check response:
{ "status": "healthy", "timestamp": "2024-01-15T10:30:00Z", "version": "1.2.3", "components": { "database": { "status": "healthy", "latency_ms": 5.2 }, "cache": { "status": "healthy", "latency_ms": 1.1 }, "payment-service": { "status": "degraded", "latency_ms": 450, "message": "High latency detected" } }}Status Levels
Section titled “Status Levels”| Status | Meaning | HTTP Code |
|---|---|---|
| healthy | All systems go | 200 |
| degraded | Working but impaired | 200 (with warning) |
| unhealthy | Cannot serve traffic | 503 |
The Heartbeat Pattern
Section titled “The Heartbeat Pattern”For distributed systems with multiple nodes, components need to know if their peers are alive. Heartbeats are periodic signals saying “I’m still here.”
Heartbeat State Machine
Section titled “Heartbeat State Machine”When tracking heartbeats, nodes go through states:
Heartbeat Timing
Section titled “Heartbeat Timing”| Parameter | Typical Value | Purpose |
|---|---|---|
| Interval | 5 seconds | How often to send |
| Suspect Timeout | 10 seconds | When to start worrying |
| Dead Timeout | 15-30 seconds | When to declare dead |
Best Practices
Section titled “Best Practices”1. Don’t Check Dependencies in Liveness
Section titled “1. Don’t Check Dependencies in Liveness”2. Set Appropriate Timeouts
Section titled “2. Set Appropriate Timeouts”| Probe | Recommended Timeout | Why |
|---|---|---|
| Liveness | 1-5 seconds | Should be instant |
| Readiness | 5-10 seconds | Dependencies may be slow |
| Startup | 30-300 seconds | Initial load can take time |
3. Use Proper HTTP Status Codes
Section titled “3. Use Proper HTTP Status Codes”| Scenario | HTTP Code | Meaning |
|---|---|---|
| Everything healthy | 200 | Keep serving traffic |
| Degraded but working | 200 | Serve but alert operators |
| Cannot serve requests | 503 | Remove from load balancer |
| Check timed out | 503 | Treat as unhealthy |
Real-World Examples
Section titled “Real-World Examples”Example 1: Kubernetes Health Checks
Section titled “Example 1: Kubernetes Health Checks”Company: Google (Kubernetes), Cloud Native Computing Foundation
Scenario: Kubernetes uses liveness and readiness probes to automatically restart unhealthy pods and route traffic only to healthy instances.
Implementation: Uses three types of health checks:
Why Three Types?
- Liveness: Detects deadlocks, restart if needed
- Readiness: Prevents traffic to unready pods
- Startup: Handles slow-starting applications
- Result: Automatic recovery and traffic management
Real-World Impact:
- Scale: Millions of pods managed globally
- Recovery: Automatic restart in < 30 seconds
- Availability: 99.9%+ pod availability
Example 2: AWS ELB Health Checks
Section titled “Example 2: AWS ELB Health Checks”Company: Amazon Web Services
Scenario: Elastic Load Balancer (ELB) performs health checks on EC2 instances to route traffic only to healthy instances.
Implementation: Uses periodic health checks:
Why Health Checks?
- Traffic Routing: Only healthy instances receive traffic
- Automatic Recovery: Re-add instances when healthy
- High Availability: Survives instance failures
- Result: 99.99% availability
Real-World Impact:
- Scale: Millions of instances behind ELBs
- Check Frequency: Every 30 seconds
- Recovery: Automatic re-addition when healthy
LLD ↔ HLD Connection
Section titled “LLD ↔ HLD Connection”| Health Check Concept | LLD Implementation |
|---|---|
| Health Checker Interface | Strategy pattern — different checks for different components |
| Aggregated Health | Composite pattern — combining multiple checkers |
| State Changes | Observer pattern — notify when health changes |
| Heartbeat Thread | Daemon thread pattern — background periodic task |
| Component Checks | Dependency injection — pass dependencies for testing |
Key Takeaways
Section titled “Key Takeaways”What’s Next?
Section titled “What’s Next?”You’ve completed the Reliability & Availability section! You now understand:
- Availability patterns and SLAs
- Replication strategies
- Fault tolerance techniques
- Health monitoring
Continue exploring: Check out the next section on Consistency & Distributed Transactions to learn how data stays consistent across distributed systems.