Skip to content
Low Level Design Mastery Logo
LowLevelDesign Mastery

API Rate Limiting & Throttling

Protecting your API from abuse, one request at a time

Without rate limiting:

  • One client can overwhelm your API - A single client can send unlimited requests, consuming all resources
  • DDoS attacks possible - Attackers can easily overwhelm your service with excessive requests
  • Unfair resource usage - Aggressive clients consume resources meant for all users
  • Backend services crash - Overload can cause services to fail, affecting all users

With rate limiting:

  • Fair usage for all clients - Each client gets equal access within defined limits
  • Protection from abuse - Prevents malicious or buggy clients from causing harm
  • Backend services protected - Services remain available even under high load
  • Predictable costs - Resource usage stays within expected bounds
Diagram

Rate limiting is used everywhere in production systems. Here are examples from popular services:

Rate Limits:

  • Authenticated requests: 5,000 requests/hour
  • Unauthenticated requests: 60 requests/hour
  • Search API: 30 requests/minute

Implementation: GitHub uses a token bucket algorithm with different limits for authenticated vs unauthenticated users. When you exceed the limit, you receive a 403 Forbidden response with headers indicating when the limit resets:

HTTP/1.1 403 Forbidden
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1609459200
X-RateLimit-Used: 5000

Why these limits? GitHub needs to protect their infrastructure while allowing legitimate developers to build applications. The higher limit for authenticated users encourages API key usage, which helps GitHub identify and manage traffic better.

Rate Limits (v2 API):

  • Tweet lookup: 300 requests/15 minutes per user
  • User lookup: 300 requests/15 minutes per user
  • Search: 180 requests/15 minutes per user
  • Post tweet: 300 requests/15 minutes per user

Implementation: Twitter uses a sliding window algorithm. Different endpoints have different limits based on resource cost. More expensive operations (like search) have lower limits.

Real scenario: A social media analytics tool needs to fetch tweets for 1,000 users. With a 300 requests/15 minutes limit, it would take at least 50 minutes to complete, requiring careful request scheduling and rate limit tracking.

Rate Limits:

  • Default: 10,000 requests/second per region
  • Burst: 5,000 requests/second
  • Per-key: Configurable (e.g., 100 requests/second per API key)

Implementation: AWS uses a token bucket with burst capacity. The burst allows short spikes above the steady-state rate, which is perfect for handling traffic patterns with occasional peaks.

Use case: An e-commerce site during Black Friday. Normal traffic is 1,000 requests/second, but during flash sales, traffic spikes to 5,000 requests/second for 30 seconds. The burst capacity handles these spikes without rejecting requests.

Rate Limits:

  • Test mode: 25 requests/second
  • Live mode: 100 requests/second per API key
  • Idempotency: Separate limits for idempotent requests

Implementation: Stripe uses a combination of token bucket and sliding window. They also implement idempotency keys to prevent duplicate charges, which have separate rate limits.

Critical use case: Payment processing. Stripe must prevent both abuse and accidental duplicate charges. Rate limiting protects their infrastructure while idempotency keys protect customers from double-charging.

Rate Limits:

  • Geocoding: 40 requests/second
  • Directions: 40 requests/second
  • Places: 100 requests/second
  • Quota: 40,000 requests/day (free tier)

Implementation: Google uses both per-second rate limits and daily quotas. This dual approach prevents both short-term abuse and long-term overuse.

Example: A delivery app needs to geocode addresses. With 40 requests/second, it can process 2,400 addresses per minute. For a delivery service handling 10,000 orders/day, this requires careful batching and caching of geocoded addresses.

Rate Limits:

  • OAuth requests: 60 requests/minute
  • Unauthenticated: 30 requests/minute
  • Per-user: Tracks usage per OAuth token

Implementation: Reddit uses a fixed window algorithm with per-user tracking. This prevents individual users from overwhelming the API while allowing fair distribution across all users.

Real-world impact: A Reddit bot that posts comments needs to respect the 60 requests/minute limit. Posting too quickly results in temporary bans, requiring exponential backoff and retry logic.

Rate Limiting Rules:

  • Custom rules: Configurable (e.g., 100 requests/minute per IP)
  • Challenge pages: CAPTCHA after limit exceeded
  • Bypass rules: For trusted IPs or authenticated users

Implementation: Cloudflare uses distributed rate limiting across their global network. Rules are evaluated at edge locations, providing protection before traffic reaches origin servers.

DDoS protection: During a DDoS attack, Cloudflare’s rate limiting automatically blocks excessive requests from individual IPs while allowing legitimate traffic through. This protects origin servers from being overwhelmed.

Rate Limits (before shutdown):

  • Per application: 5,000 requests/hour
  • Per user: 20 requests/second

Why it mattered: Netflix’s API was used by third-party applications to access movie metadata. Rate limiting prevented abuse while allowing legitimate developers to build applications. The API was eventually shut down in favor of direct partnerships, but rate limiting was crucial during its operation.



Bucket holds tokens. Tokens refill at fixed rate. Request consumes token.

Diagram

Characteristics:

  • Allows bursts (up to bucket capacity) - Can handle sudden spikes in traffic up to bucket size
  • Smooths to average rate over time - Over long periods, rate averages to refill rate
  • Simple to implement - Straightforward algorithm with clear logic
  • Most popular algorithm - Widely used and well-understood

Algorithm:

  1. Initialize bucket with capacity tokens
  2. Refill tokens at refill_rate per second
  3. When request arrives:
    • If tokens available → consume token, allow request
    • If no tokens → reject request (429)

Bucket holds requests. Requests leak out at fixed rate. If full, reject.

Diagram

Characteristics:

  • Smooths traffic to constant rate - Requests are processed at a steady, predictable rate
  • No bursts allowed - Cannot handle sudden spikes in traffic like token bucket
  • Predictable output rate - Output rate is constant, making it easier to plan capacity
  • Less flexible than token bucket - Cannot accommodate bursts, which may be needed for some use cases

Algorithm:

  1. Initialize bucket with capacity (queue size)
  2. Process requests at leak_rate per second
  3. When request arrives:
    • If bucket has space → add to bucket
    • If bucket full → reject (429)

Tracks requests in sliding time window. More accurate than fixed window.

Diagram

Characteristics:

  • More accurate than fixed window - Provides more precise rate limiting without boundary bursts
  • No burst at boundaries - Eliminates the burst problem that occurs in fixed window at window boundaries
  • More memory intensive (stores timestamps) - Requires storing timestamps for each request, increasing memory usage
  • More complex to implement - Requires more sophisticated logic to track sliding windows

Divides time into fixed windows. Simple but allows bursts.

Diagram

Characteristics:

  • Simple to implement - Straightforward logic with minimal complexity
  • Low memory usage - Only needs to store a counter per window, very efficient
  • Allows bursts at boundaries - Can allow double the limit when windows reset (e.g., 100 requests at 10:00:59 and 10:01:00)
  • Less accurate - Less precise than sliding window due to boundary burst issue

For multiple servers, use Redis:


Rate limiting can be applied at different levels depending on your use case. Here are common strategies with real-world examples:

Use case: Public APIs where you don’t have user authentication, or as a first line of defense.

Example: A public weather API limits each IP to 100 requests/hour. This prevents a single user from scraping all weather data while allowing legitimate usage.

rate_limiter = TokenBucket(capacity=100, refill_rate=10)
client_ip = request.remote_addr
if not rate_limiter.is_allowed(client_ip):
return "Rate limit exceeded", 429

Use case: Authenticated APIs where you want to limit per-user usage, regardless of which device or IP they use.

Example: A social media API limits each authenticated user to 1,000 posts/day. This prevents spam while allowing legitimate users to post from multiple devices (phone, tablet, desktop).

Real-world scenario: A user tries to post 1,500 times in one day. After 1,000 posts, all subsequent requests return 429 until the next day. This protects the platform from spam while being fair to legitimate users.

Use case: Third-party integrations where each application gets its own API key with specific limits.

Example: A payment processing API provides each merchant with an API key. Free tier merchants get 1,000 transactions/month, while enterprise merchants get 100,000 transactions/month.

Real-world scenario: An e-commerce platform integrates with a payment API. They receive an API key with a 10,000 requests/day limit. During peak shopping season, they might hit this limit and need to upgrade their plan or implement request queuing.

Use case: SaaS products with multiple pricing tiers. Each tier gets different rate limits as part of the subscription.

Example: A cloud storage API offers three tiers:

  • Free: 100 API calls/hour
  • Pro: 1,000 API calls/hour
  • Enterprise: 10,000 API calls/hour

Real-world scenario: A file backup application uses the API. Free users can sync files 100 times per hour, which is sufficient for personal use. Pro users (developers) get 1,000 calls/hour, enough for automated backups. Enterprise customers get 10,000 calls/hour for large-scale operations.

Business value: Tiered limits encourage upgrades. Users hitting free tier limits often upgrade to Pro, increasing revenue.

Use case: Different API endpoints have different costs. Expensive operations get lower limits.

Example: A machine learning API:

  • Image classification: 100 requests/minute (cheap, fast)
  • Video processing: 10 requests/minute (expensive, slow)
  • Model training: 1 request/hour (very expensive, takes time)

Real-world scenario: A video editing app uses the ML API. Users can classify images quickly (100/min), but video processing is limited to 10/min to prevent resource exhaustion. Model training requests are queued and processed one at a time.

# Different limits per endpoint
endpoint_limits = {
'/api/classify-image': TokenBucket(100, 100/60), # 100/min
'/api/process-video': TokenBucket(10, 10/60), # 10/min
'/api/train-model': TokenBucket(1, 1/3600), # 1/hour
}
endpoint = request.path
if not endpoint_limits[endpoint].is_allowed(user_id):
return "Rate limit exceeded", 429

Use case: Global APIs that want to distribute load or comply with regional regulations.

Example: A content delivery API:

  • US/EU: 1,000 requests/second
  • Asia: 500 requests/second (smaller infrastructure)
  • Other regions: 100 requests/second

Real-world scenario: A global news aggregator API serves different regions. Traffic from high-traffic regions (US/EU) gets higher limits, while emerging markets get lower limits initially, scaling up as infrastructure grows.

# Regional limits
regional_limits = {
'us': TokenBucket(1000, 1000),
'eu': TokenBucket(1000, 1000),
'asia': TokenBucket(500, 500),
'other': TokenBucket(100, 100),
}
region = get_region_from_ip(request.remote_addr)
if not regional_limits[region].is_allowed(request.remote_addr):
return "Rate limit exceeded", 429

Use case: Adjust limits based on system load or user behavior.

Example: During normal load, users get 100 requests/minute. During high load, limits drop to 50 requests/minute to protect the system. Trusted users (good history) get 150 requests/minute.

Real-world scenario: A ride-sharing API dynamically adjusts limits. During rush hour (high load), all users get reduced limits. During off-peak hours, limits increase. Users with good payment history get higher limits.

def get_dynamic_limit(user_id):
base_limit = 100 # requests/minute
# Adjust based on system load
current_load = get_system_load()
if current_load > 0.8: # High load
base_limit = base_limit * 0.5 # Reduce to 50
# Adjust based on user trust
user_trust_score = get_user_trust_score(user_id)
if user_trust_score > 0.9: # Trusted user
base_limit = base_limit * 1.5 # Increase to 75-150
return TokenBucket(int(base_limit), base_limit/60)
limiter = get_dynamic_limit(user_id)
if not limiter.is_allowed(user_id):
return "Rate limit exceeded", 429

Inform clients about rate limits:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1640995200

When limit exceeded:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640995200
Retry-After: 60

AlgorithmBurstsAccuracyMemoryComplexity
Token BucketYesHighLowMedium
Leaky BucketNoHighMediumMedium
Sliding WindowNoVery HighHighHigh
Fixed WindowYesMediumLowLow

Recommendation: Use Token Bucket for most cases. It’s simple, accurate, and allows bursts.


🪣 Token Bucket: Most Popular

Token bucket allows bursts, smooths to average rate. Most widely used algorithm.

📊 Sliding Window: Most Accurate

Sliding window is most accurate but uses more memory. Use when accuracy is critical.

🌐 Distributed: Use Redis

For multiple servers, use Redis with Lua scripts for atomic operations.

🔢 Return 429

When rate limit exceeded, return HTTP 429 with Retry-After header. Inform clients about limits.


  • Review API Gateway - rate limiting is often implemented in gateways
  • Learn Caching - cache rate limit data for performance
  • Understand Distributed Systems - distributed rate limiting challenges