Timeouts & Deadlines

Preventing hung requests with timeouts and deadlines

The Problem: Hung Requests

Without timeouts, slow operations hang indefinitely. A slow database query can block a thread for minutes or hours. Threads accumulate, resources are exhausted, and the system crashes. Users wait forever for responses that never come.

With timeouts, operations are cancelled after a maximum wait time. If a database query doesn’t complete within 5 seconds, it’s cancelled, the thread is freed, and an error is returned. Resources are released quickly, the system remains responsive, and users get timely feedback.

Timeouts are essential for resilience. They prevent hung requests from consuming resources indefinitely and ensure systems remain responsive even when operations are slow.

What are Timeouts and Deadlines?

A timeout is the maximum duration to wait for an operation to complete. If the operation doesn’t complete within the timeout, it’s cancelled and an error is returned. Think of ordering food—if the restaurant doesn’t respond in 30 minutes, you cancel the order and try another restaurant.

A deadline is the absolute time when an operation must complete. It’s calculated as start time plus timeout. Deadlines are useful in distributed systems because they can propagate across services—if an upstream service has a 5-second deadline, downstream services must respect the remaining time.

Real-World Scenario: Google’s Request Timeouts

Google handles billions of requests per day. Without timeouts, slow operations would exhaust resources and crash the system. Google uses aggressive timeouts at multiple levels:

Frontend timeout: User-facing requests have short timeouts (1-2 seconds). If a backend doesn’t respond quickly, the frontend returns a cached response or error. Users get fast feedback.

Backend timeout: Internal services have longer timeouts (5-30 seconds) depending on the operation. Database queries might have 10-second timeouts, while external API calls might have 30-second timeouts.

Deadline propagation: Google uses gRPC with deadline propagation. When a request starts with a 5-second deadline, all downstream services respect the remaining time. If 2 seconds have elapsed, downstream services have 3 seconds remaining.

The impact: Google’s systems remain responsive even when individual operations are slow. Timeouts prevent resource exhaustion, and deadline propagation ensures end-to-end latency is bounded.

Timeout vs Deadline

Understanding the difference between timeouts and deadlines is important for implementing them correctly.

Timeout (Duration)

A timeout is a duration—the maximum time to wait. It’s relative to when the operation starts. For example, “wait 5 seconds” means wait 5 seconds from now.

Use cases: Timeouts are simple and work well for single operations. They’re easy to implement and understand. Use timeouts when you don’t need to propagate timing constraints across services.

Limitations: Timeouts don’t propagate well. If Service A calls Service B with a 5-second timeout, and Service B calls Service C, Service C doesn’t know about Service A’s timeout. Each service sets its own timeout independently.

Deadline (Absolute Time)

A deadline is an absolute time—when the operation must complete. It’s calculated as start time plus timeout. For example, if you start at 10:00:00 with a 5-second timeout, the deadline is 10:00:05.

Use cases: Deadlines propagate across services. If Service A has a deadline of 10:00:05, Service B knows it must complete by 10:00:05. Service B can calculate remaining time and set appropriate timeouts for Service C.

Benefits: Deadlines ensure end-to-end latency is bounded. All services respect the same deadline, preventing cascading timeouts. This is critical for user-facing requests where total latency matters.

Real-world example: A user request starts at 10:00:00 with a 5-second deadline (10:00:05). Service A processes for 1 second, then calls Service B at 10:00:01 with deadline 10:00:05 (4 seconds remaining). Service B processes for 2 seconds, then calls Service C at 10:00:03 with deadline 10:00:05 (2 seconds remaining). Service C completes within 2 seconds, and the total request completes before the deadline.

Choosing Timeout Values

Choosing appropriate timeout values is crucial. Too short, and you cancel operations that would succeed. Too long, and you waste resources waiting for operations that will fail.

API calls: 1-5 seconds. Most APIs should respond quickly. If an API doesn’t respond in 5 seconds, it’s likely having problems.

Database queries: 2-10 seconds. Simple queries should complete quickly. Complex queries might take longer, but 10 seconds is usually sufficient.

External services: 5-30 seconds. External services are less reliable. Give them more time, but not too much—30 seconds is usually the maximum.

File I/O: 10-60 seconds. File operations can be slow, especially for large files. Set longer timeouts, but still bound them.

Adjust based on P95 latency: Set timeout greater than P95 latency but less than user tolerance. If P95 latency is 2 seconds, set timeout to 3-4 seconds. This ensures most requests succeed while preventing extremely slow requests from hanging.

Real-world example: An e-commerce site sets API timeouts to 3 seconds (P95 latency is 1.5 seconds). Database query timeouts are 5 seconds (P95 latency is 2 seconds). External payment gateway timeouts are 10 seconds (external services are less reliable). These values balance success rate with resource usage.

Deadline Propagation

Deadline propagation passes deadlines from upstream services to downstream services. All services respect the same deadline, ensuring end-to-end latency is bounded.

How it works: When Service A calls Service B, it passes its deadline. Service B calculates remaining time and uses it for its operations. When Service B calls Service C, it passes the same deadline. All services work toward the same deadline.

Benefits: Ensures end-to-end latency is bounded. Prevents cascading timeouts. All services work together to meet the deadline. Critical for user-facing requests.

Implementation: Use gRPC with deadline propagation, or implement custom deadline headers. Calculate remaining time at each service and set timeouts accordingly.

Real-world example: A user request has a 5-second deadline. Service A uses 1 second, then calls Service B with 4 seconds remaining. Service B uses 2 seconds, then calls Service C with 2 seconds remaining. Service C completes within 2 seconds. Total latency is 5 seconds, meeting the deadline.

Timeout Implementation (Interview Focus)

Timeout wrappers are a common interview topic. Here’s how to implement them:

Combining Timeouts with Other Patterns

Timeouts work best when combined with other resilience patterns:

With Circuit Breaker: Circuit breakers prevent requests to failing services, while timeouts prevent hanging operations. If a service’s circuit is open, don’t wait for timeouts—fail fast instead.

With Retry: Retry patterns handle transient failures, while timeouts prevent retries from hanging. Set timeouts on each retry attempt, and limit total retry time.

With Bulkhead: Bulkheads isolate resources, while timeouts prevent threads from blocking indefinitely. Combined, they ensure resources are released even when operations are slow.

Real-world example: A microservices application uses timeouts to prevent hanging operations, circuit breakers to prevent requests to failing services, retry patterns to handle transient failures, and bulkheads to isolate resources. This combination provides comprehensive resilience—operations are bounded, failures are isolated, and resources are protected.

Key Takeaways

Set Timeouts

Always set timeouts on operations. Prevent hung requests from consuming resources indefinitely. Essential for resilience.

Choose Values Carefully

Set timeout > P95 latency but < user tolerance. Too short = false cancellations, too long = resource waste.

Use Deadlines

Use deadlines for distributed systems. Deadlines propagate across services, ensuring end-to-end latency is bounded.

Propagate Deadlines

Pass deadlines from upstream to downstream services. All services respect the same deadline. Critical for user-facing requests.

Combine Patterns

Combine with circuit breakers, retries, and bulkheads. Patterns work together for comprehensive resilience.

Monitor Timeout Rates

Track timeout rates. High timeout rates indicate problems. Low success rates after timeouts indicate permanent failures.

Circuit Breaker Pattern - Prevent requests to failing services
Retry Patterns - Handle transient failures
Bulkhead Pattern - Isolate resources
Designing for Failure - Failure design principles