Retry Patterns & Backoff

Retrying intelligently to handle transient failures

Why Retry?

Transient failures are common in distributed systems. Network hiccups cause temporary connection failures. Services experience temporary overload. Database connections timeout under load. Rate limiting returns 429 errors that resolve after a short wait.

Without retry logic, you give up on the first failure. Users see errors unnecessarily. Many requests that would succeed on retry fail permanently. This creates a poor user experience and wastes opportunities.

With retry logic, transient failures are handled automatically. If a request fails temporarily, you wait and try again. Most transient failures resolve quickly, so retries often succeed. This improves user experience and increases success rates.

What is Retry?

Retry patterns handle transient failures by automatically retrying failed requests. The key insight is that most failures are temporary—network glitches resolve, overloaded services recover, rate limits reset. By retrying intelligently, you can handle these failures without user intervention.

Think of calling a friend. If you call once and get a busy signal, you don’t give up forever. You wait a bit and call again. If it’s still busy, you wait longer and try again. Eventually, you get through. Retry patterns work the same way in software.

When to Retry

Not all failures should be retried. Understanding which failures are retryable is crucial for effective retry logic.

Retry On (Transient Failures)

Network errors like connection refused, timeout, or network unreachable are transient. Networks are unreliable, and temporary glitches are common. These failures often resolve on retry.

5xx server errors like 500 (Internal Server Error), 502 (Bad Gateway), 503 (Service Unavailable), or 504 (Gateway Timeout) indicate server-side problems that are often temporary. Overloaded servers recover, bugs get fixed, dependencies come back online.

Rate limiting (429) indicates the service is temporarily limiting requests. After waiting, retries often succeed. This is why 429 is retryable even though it’s a 4xx error.

Timeout errors indicate the service didn’t respond in time. This could be due to temporary overload or network issues. Retries often succeed when the service recovers.

Don’t Retry On (Permanent Failures)

4xx client errors like 400 (Bad Request), 401 (Unauthorized), 403 (Forbidden), or 404 (Not Found) indicate client-side problems. Retrying won’t help—the request is invalid, authentication failed, or the resource doesn’t exist.

Validation errors indicate the request data is invalid. Retrying with the same invalid data will always fail. Fix the data instead of retrying.

Authentication errors like 401 (Unauthorized) indicate credentials are invalid. Retrying won’t help—you need valid credentials.

Real-world example: An API call returns 400 (Bad Request) because a required field is missing. Retrying won’t help—you need to fix the request. However, if the same call returns 503 (Service Unavailable), retrying makes sense because the service might recover.

Exponential Backoff

Exponential backoff increases wait time exponentially between retries. The first retry waits 1 second, the second waits 2 seconds, the third waits 4 seconds, the fourth waits 8 seconds, and so on. This prevents overwhelming a failing service with retry requests.

Why exponential? When a service is failing, it needs time to recover. Sending retry requests too quickly doesn’t help—the service is still failing. By waiting longer between retries, you give the service time to recover.

The math: Wait time = base_delay * (2 ^ attempt_number). For base_delay = 1 second: attempt 1 waits 1s, attempt 2 waits 2s, attempt 3 waits 4s, attempt 4 waits 8s.

Maximum wait time: Most implementations cap the wait time (e.g., max 60 seconds) to prevent extremely long waits. After the cap, all retries wait the maximum time.

Real-world example: AWS SDK uses exponential backoff with jitter. When an API call fails, it waits 1 second before retrying. If it fails again, it waits 2 seconds. This continues up to a maximum wait time, giving AWS services time to recover from temporary issues.

Jitter

Jitter adds randomness to backoff time to prevent thundering herd problems. Without jitter, all clients retry at the same time, creating synchronized retry storms that overwhelm the service. With jitter, clients retry at slightly different times, spreading the load.

The problem: Imagine 1000 clients all trying to access a service that just recovered. Without jitter, they all retry at exactly the same time (e.g., after 4 seconds). This creates a sudden spike that can overwhelm the service again.

The solution: Add randomness to the wait time. Instead of waiting exactly 4 seconds, wait 4 seconds plus a random amount (e.g., 0-1 seconds). This spreads retries over a time window, preventing synchronized retries.

Types of jitter:

Full jitter: Random between 0 and calculated backoff time. More spread, less predictable.
Equal jitter: Half fixed, half random. More predictable, still spreads retries.
Decorrelated jitter: More sophisticated algorithm that adapts based on success/failure.

Real-world example: When AWS S3 had an outage, thousands of clients were retrying. Without jitter, they would all retry simultaneously, creating a retry storm. AWS SDKs use jitter to spread retries over time, preventing the retry storm from overwhelming the recovering service.

Retry Limits

Retry limits prevent infinite retries. Without limits, a permanently failing service causes infinite retry loops that waste resources and never succeed.

Common values: Most systems use 3-5 retries. Too few retries mean giving up too early on transient failures. Too many retries waste resources on permanent failures.

Combining with timeouts: Retry limits should be combined with timeouts. If each retry has a 30-second timeout and you retry 5 times, the maximum time is 150 seconds plus backoff time. This might be too long for user-facing requests.

Combining with circuit breakers: Circuit breakers can prevent retries when a service is known to be failing. If the circuit is open, don’t retry—fail fast instead.

Real-world example: A payment processing service retries up to 3 times with exponential backoff. If all retries fail, it returns an error to the user. This balances handling transient failures with not keeping users waiting too long.

Retry Implementation (Interview Focus)

Retry logic with exponential backoff and jitter is a common interview topic. Here’s how to implement it:

Combining Retry with Circuit Breaker

Retry patterns work best when combined with circuit breakers. Circuit breakers prevent retries when a service is known to be failing, while retry patterns handle transient failures when the circuit is closed.

The flow: When a request fails, check if the circuit breaker is open. If open, don’t retry—fail fast. If closed, retry with exponential backoff. If retries exhaust and the circuit is still closed, open the circuit breaker to prevent further retries.

Benefits: This combination provides the best of both worlds. Retry handles transient failures, while circuit breakers prevent wasting resources on permanent failures. The circuit breaker learns from retry failures and opens when failures persist.

Real-world example: A microservice uses retry with exponential backoff for transient failures. After 3 retries fail, the circuit breaker opens. Future requests fail fast without retrying. After a timeout, the circuit enters half-open state and allows one test request. If it succeeds, the circuit closes and retries resume. If it fails, the circuit opens again.

Key Takeaways

Retry Transient Failures

Retry only on transient failures (network errors, 5xx, timeouts). Don’t retry permanent failures (4xx, validation errors).

Exponential Backoff

Increase wait time exponentially (1s, 2s, 4s, 8s). Prevents overwhelming failing services. Standard approach.

Add Jitter

Add randomness to prevent thundering herd. Spreads retries over time, preventing synchronized retry storms.

Limit Retries

Set retry limits (3-5 attempts). Too few = give up early, too many = waste resources. Combine with timeouts.

Combine with Circuit Breaker

Use circuit breaker to prevent retries on permanent failures. Retry handles transient failures, circuit breaker prevents waste.

Monitor Retry Rates

Track retry success rates. High retry rates indicate problems. Low success rates after retries indicate permanent failures.

Circuit Breaker Pattern - Prevent retries on permanent failures
Timeouts & Deadlines - Set request timeouts
Designing for Failure - Failure design principles
Bulkhead Pattern - Isolate retry resources

Retry Patterns & Backoff

Why Retry?

What is Retry?

When to Retry

Retry On (Transient Failures)

Don’t Retry On (Permanent Failures)

Exponential Backoff

Jitter

Retry Limits

Retry Implementation (Interview Focus)

Combining Retry with Circuit Breaker

Key Takeaways

Related Topics

Further Reading