Health Checks & Heartbeats
Know you're sick before your users tell you
Why Health Checks Matter
Section titled “Why Health Checks Matter”In distributed systems, things fail silently. A service might be running but:
- Database connection is broken
- Memory is exhausted
- Thread pool is full
- External API is unreachable
Health checks detect these issues before users do.
Two Types: Liveness vs Readiness
Section titled “Two Types: Liveness vs Readiness”These serve different purposes and should be implemented separately:
The Key Difference
Section titled “The Key Difference”| Check | Question | Failure Response | Frequency |
|---|---|---|---|
| Liveness | ”Is it alive?” | Restart container | Every 10-30s |
| Readiness | ”Can it serve traffic?” | Remove from load balancer | Every 5-10s |
| Startup | ”Is it initialized?” | Wait for completion | During startup |
Shallow vs Deep Health Checks
Section titled “Shallow vs Deep Health Checks”What Each Check Should Test
Section titled “What Each Check Should Test”| Check Type | What to Test | Example |
|---|---|---|
| Shallow (Liveness) | Process alive, not deadlocked | Return 200 immediately |
| Database | Can connect and query | SELECT 1 completes < 100ms |
| Cache | Can read and write | Write key, read back |
| Queue | Can connect to message broker | Check connection status |
| External Service | Dependency is reachable | Hit their /health endpoint |
Implementing Health Check Endpoints
Section titled “Implementing Health Check Endpoints”A well-designed health check system has clear separation:
Response Format
Section titled “Response Format”A good deep health check response:
1{2 "status": "healthy",3 "timestamp": "2024-01-15T10:30:00Z",4 "version": "1.2.3",5 "components": {6 "database": {7 "status": "healthy",8 "latency_ms": 5.29 },10 "cache": {11 "status": "healthy",12 "latency_ms": 1.113 },14 "payment-service": {15 "status": "degraded",16 "latency_ms": 450,17 "message": "High latency detected"18 }19 }20}Status Levels
Section titled “Status Levels”| Status | Meaning | HTTP Code |
|---|---|---|
| healthy | All systems go | 200 |
| degraded | Working but impaired | 200 (with warning) |
| unhealthy | Cannot serve traffic | 503 |
The Heartbeat Pattern
Section titled “The Heartbeat Pattern”For distributed systems with multiple nodes, components need to know if their peers are alive. Heartbeats are periodic signals saying “I’m still here.”
Heartbeat State Machine
Section titled “Heartbeat State Machine”When tracking heartbeats, nodes go through states:
Heartbeat Timing
Section titled “Heartbeat Timing”| Parameter | Typical Value | Purpose |
|---|---|---|
| Interval | 5 seconds | How often to send |
| Suspect Timeout | 10 seconds | When to start worrying |
| Dead Timeout | 15-30 seconds | When to declare dead |
Best Practices
Section titled “Best Practices”1. Don’t Check Dependencies in Liveness
Section titled “1. Don’t Check Dependencies in Liveness”2. Set Appropriate Timeouts
Section titled “2. Set Appropriate Timeouts”| Probe | Recommended Timeout | Why |
|---|---|---|
| Liveness | 1-5 seconds | Should be instant |
| Readiness | 5-10 seconds | Dependencies may be slow |
| Startup | 30-300 seconds | Initial load can take time |
3. Use Proper HTTP Status Codes
Section titled “3. Use Proper HTTP Status Codes”| Scenario | HTTP Code | Meaning |
|---|---|---|
| Everything healthy | 200 | Keep serving traffic |
| Degraded but working | 200 | Serve but alert operators |
| Cannot serve requests | 503 | Remove from load balancer |
| Check timed out | 503 | Treat as unhealthy |
LLD ↔ HLD Connection
Section titled “LLD ↔ HLD Connection”| Health Check Concept | LLD Implementation |
|---|---|
| Health Checker Interface | Strategy pattern — different checks for different components |
| Aggregated Health | Composite pattern — combining multiple checkers |
| State Changes | Observer pattern — notify when health changes |
| Heartbeat Thread | Daemon thread pattern — background periodic task |
| Component Checks | Dependency injection — pass dependencies for testing |
Key Takeaways
Section titled “Key Takeaways”What’s Next?
Section titled “What’s Next?”You’ve completed the Reliability & Availability section! You now understand:
- ✅ Availability patterns and SLAs
- ✅ Replication strategies
- ✅ Fault tolerance techniques
- ✅ Health monitoring
Continue exploring: Check out the next section on Consistency & Distributed Transactions to learn how data stays consistent across distributed systems.