Circuit Breaker Pattern

Preventing cascade failures by breaking the circuit

The Problem: Cascade Failures

Without a circuit breaker, a failing service continues receiving requests. Each request times out after 30 seconds, consuming threads and resources. The failure spreads to other services as they wait for responses that never come. Eventually, the entire system crashes.

With a circuit breaker, a failing service is isolated immediately. When the circuit opens, requests are rejected instantly without waiting for timeouts. The failure is contained and doesn’t spread. Other services continue operating normally, protected from the cascade.

Cascade failures are one of the most dangerous failure modes in distributed systems. A single failing service can bring down an entire system if not properly isolated. The circuit breaker pattern is specifically designed to prevent this.

What is a Circuit Breaker?

A circuit breaker is a design pattern that prevents cascade failures by stopping requests to failing services. It’s named after electrical circuit breakers that protect electrical systems from overload.

Think of an electrical circuit breaker in your house. During normal operation, electricity flows normally and everything works. When an overload is detected, the circuit breaker trips, stopping electricity flow to protect the wiring from damage. After some time, you can try again—if the problem is fixed, the circuit closes; if not, it stays open.

The circuit breaker pattern works the same way in software. During normal operation, requests flow through to the service. When failures exceed a threshold, the circuit opens, stopping requests to protect the system. After a timeout period, the circuit enters a half-open state to test if the service recovered. If the test succeeds, the circuit closes; if it fails, the circuit opens again.

Real-World Scenario: Netflix Hystrix

Netflix developed Hystrix, one of the most famous circuit breaker implementations, to handle failures in their microservices architecture. Netflix has hundreds of microservices, and failures in one service can cascade to others if not properly handled.

The problem Netflix faced: When a downstream service failed, all requests to that service would timeout after 30 seconds. This consumed threads and resources, eventually causing the calling service to fail as well. The failure cascaded through the system.

The solution: Hystrix circuit breakers. When a service fails repeatedly, the circuit opens. Requests are rejected immediately without waiting for timeouts. The failing service gets time to recover. After a timeout period, the circuit enters half-open state to test recovery. If the test succeeds, the circuit closes and normal operation resumes.

The impact: Netflix’s system became much more resilient. Failures were isolated and contained. Services could continue operating even when dependencies failed. The system could handle partial outages gracefully.

Circuit Breaker States

A circuit breaker has three states: CLOSED, OPEN, and HALF-OPEN. Understanding these states and their transitions is crucial for implementing circuit breakers correctly.

CLOSED State

In the CLOSED state, the circuit breaker allows requests to pass through normally. It monitors failures, tracking failure counts and error rates. If failures exceed a threshold, the circuit transitions to OPEN state.

Normal operation: Requests flow through to the service. Failures are tracked but don’t prevent requests. The service is healthy and responding normally.

Failure detection: The circuit breaker tracks consecutive failures or error rates. When failures exceed the threshold (e.g., 5 failures in 10 seconds), the circuit opens.

OPEN State

In the OPEN state, the circuit breaker rejects requests immediately without calling the service. This prevents resources from being consumed by requests that will fail anyway. The circuit stays open for a timeout period, then transitions to HALF-OPEN to test recovery.

Immediate rejection: Requests are rejected instantly, returning a fallback response or error. No timeouts, no resource consumption. The failing service is completely isolated.

Timeout period: The circuit stays open for a configurable timeout period (e.g., 60 seconds). This gives the failing service time to recover before testing again.

HALF-OPEN State

In the HALF-OPEN state, the circuit breaker allows a limited number of requests (typically one) to test if the service recovered. If the test request succeeds, the circuit closes. If it fails, the circuit opens again.

Testing recovery: A single request is allowed through to test if the service is healthy again. This is a cautious approach—if the service is still failing, only one request is wasted.

State transition: If the test succeeds, the circuit closes and normal operation resumes. If the test fails, the circuit opens again and the timeout period restarts.

State Transitions

Understanding state transitions is key to implementing circuit breakers correctly:

CLOSED → OPEN: This transition occurs when failures exceed the threshold. The threshold can be based on consecutive failures (e.g., 5 failures) or error rate (e.g., 50% errors). The circuit opens to protect the system.

OPEN → HALF-OPEN: This transition occurs after the timeout period elapses. The circuit is ready to test if the service recovered. This is an automatic transition—the circuit doesn’t wait for a request.

HALF-OPEN → CLOSED: This transition occurs when the test request succeeds. The service appears healthy, so normal operation resumes. The failure count is reset.

HALF-OPEN → OPEN: This transition occurs when the test request fails. The service is still failing, so the circuit opens again. The timeout period restarts.

Configuration Parameters

Circuit breakers have several configuration parameters that affect their behavior:

Failure threshold is the number of failures before opening the circuit. Too low, and the circuit opens too easily (false positives). Too high, and too many failures occur before opening. Common values are 5-10 failures.

Timeout is the seconds to wait before testing recovery (OPEN → HALF-OPEN). Too short, and the circuit tests too frequently. Too long, and the service might recover but the circuit stays open. Common values are 30-120 seconds.

Success threshold (HALF-OPEN) is the number of successes needed to close the circuit. Typically, one success is sufficient—this is a conservative approach that ensures the service is truly recovered.

Fallback Strategies

When a circuit breaker is OPEN, requests are rejected. You need a fallback strategy to handle these rejections gracefully.

Return default value - Return a sensible default when the service is unavailable. For example, return an empty list when a recommendation service is down.

Return cached data - Serve stale but available data from cache. This is common for read-heavy services where slightly stale data is acceptable.

Return error response - Return a clear error message indicating the service is temporarily unavailable. This is honest and sets user expectations.

Queue for later - Queue requests for processing when the service recovers. This works for non-critical operations that can be delayed.

Real-world example: An e-commerce site’s payment service goes down. Instead of showing an error page, the site queues payment requests and shows a message: “Payment processing is temporarily unavailable. Your order has been saved and will be processed shortly.” This provides a better user experience than a complete failure.

Circuit Breaker Implementation (Interview Focus)

Circuit breakers are a classic interview topic because they combine state machines, failure handling, and distributed systems concepts. Here’s how to implement one:

Key Takeaways

Three States

CLOSED (normal), OPEN (failing), HALF-OPEN (testing). State transitions based on failure/success counts.

Fail Fast

When OPEN, reject requests immediately. No timeout wait. Prevents resource exhaustion.

Automatic Recovery

After timeout, test recovery in HALF-OPEN. If successful, close circuit. Automatic healing.

Prevent Cascades

Circuit breaker prevents cascade failures by isolating failing services. Critical for resilience.

Configure Carefully

Set failure threshold and timeout appropriately. Too low = false positives, too high = too many failures.

Fallback Strategy

Always provide fallback (cached data, default value, error response). Don’t leave users hanging.

Retry Patterns - Retry with exponential backoff
Bulkhead Pattern - Isolate resource pools
Timeouts & Deadlines - Set request timeouts
Designing for Failure - Failure design principles
Health Checks - Monitor service health