API Rate Limiting & Throttling

Protecting your API from abuse, one request at a time

Why Rate Limiting?

Without rate limiting:

One client can overwhelm your API - A single client can send unlimited requests, consuming all resources
DDoS attacks possible - Attackers can easily overwhelm your service with excessive requests
Unfair resource usage - Aggressive clients consume resources meant for all users
Backend services crash - Overload can cause services to fail, affecting all users

With rate limiting:

Fair usage for all clients - Each client gets equal access within defined limits
Protection from abuse - Prevents malicious or buggy clients from causing harm
Backend services protected - Services remain available even under high load
Predictable costs - Resource usage stays within expected bounds

Real-World Examples

Rate limiting is used everywhere in production systems. Here are examples from popular services:

GitHub API

Rate Limits:

Authenticated requests: 5,000 requests/hour
Unauthenticated requests: 60 requests/hour
Search API: 30 requests/minute

Implementation: GitHub uses a token bucket algorithm with different limits for authenticated vs unauthenticated users. When you exceed the limit, you receive a 403 Forbidden response with headers indicating when the limit resets:

1
HTTP/1.1 403 Forbidden
2
X-RateLimit-Limit: 5000
3
X-RateLimit-Remaining: 0
4
X-RateLimit-Reset: 1609459200
5
X-RateLimit-Used: 5000

Why these limits? GitHub needs to protect their infrastructure while allowing legitimate developers to build applications. The higher limit for authenticated users encourages API key usage, which helps GitHub identify and manage traffic better.

Twitter API (X)

Rate Limits (v2 API):

Tweet lookup: 300 requests/15 minutes per user
User lookup: 300 requests/15 minutes per user
Search: 180 requests/15 minutes per user
Post tweet: 300 requests/15 minutes per user

Implementation: Twitter uses a sliding window algorithm. Different endpoints have different limits based on resource cost. More expensive operations (like search) have lower limits.

Real scenario: A social media analytics tool needs to fetch tweets for 1,000 users. With a 300 requests/15 minutes limit, it would take at least 50 minutes to complete, requiring careful request scheduling and rate limit tracking.

AWS API Gateway

Rate Limits:

Default: 10,000 requests/second per region
Burst: 5,000 requests/second
Per-key: Configurable (e.g., 100 requests/second per API key)

Implementation: AWS uses a token bucket with burst capacity. The burst allows short spikes above the steady-state rate, which is perfect for handling traffic patterns with occasional peaks.

Use case: An e-commerce site during Black Friday. Normal traffic is 1,000 requests/second, but during flash sales, traffic spikes to 5,000 requests/second for 30 seconds. The burst capacity handles these spikes without rejecting requests.

Stripe API

Rate Limits:

Test mode: 25 requests/second
Live mode: 100 requests/second per API key
Idempotency: Separate limits for idempotent requests

Implementation: Stripe uses a combination of token bucket and sliding window. They also implement idempotency keys to prevent duplicate charges, which have separate rate limits.

Critical use case: Payment processing. Stripe must prevent both abuse and accidental duplicate charges. Rate limiting protects their infrastructure while idempotency keys protect customers from double-charging.

Google Maps API

Rate Limits:

Geocoding: 40 requests/second
Directions: 40 requests/second
Places: 100 requests/second
Quota: 40,000 requests/day (free tier)

Implementation: Google uses both per-second rate limits and daily quotas. This dual approach prevents both short-term abuse and long-term overuse.

Example: A delivery app needs to geocode addresses. With 40 requests/second, it can process 2,400 addresses per minute. For a delivery service handling 10,000 orders/day, this requires careful batching and caching of geocoded addresses.

Reddit API

Rate Limits:

OAuth requests: 60 requests/minute
Unauthenticated: 30 requests/minute
Per-user: Tracks usage per OAuth token

Implementation: Reddit uses a fixed window algorithm with per-user tracking. This prevents individual users from overwhelming the API while allowing fair distribution across all users.

Real-world impact: A Reddit bot that posts comments needs to respect the 60 requests/minute limit. Posting too quickly results in temporary bans, requiring exponential backoff and retry logic.

Cloudflare

Rate Limiting Rules:

Custom rules: Configurable (e.g., 100 requests/minute per IP)
Challenge pages: CAPTCHA after limit exceeded
Bypass rules: For trusted IPs or authenticated users

Implementation: Cloudflare uses distributed rate limiting across their global network. Rules are evaluated at edge locations, providing protection before traffic reaches origin servers.

DDoS protection: During a DDoS attack, Cloudflare’s rate limiting automatically blocks excessive requests from individual IPs while allowing legitimate traffic through. This protects origin servers from being overwhelmed.

Netflix API (Historical)

Rate Limits (before shutdown):

Per application: 5,000 requests/hour
Per user: 20 requests/second

Why it mattered: Netflix’s API was used by third-party applications to access movie metadata. Rate limiting prevented abuse while allowing legitimate developers to build applications. The API was eventually shut down in favor of direct partnerships, but rate limiting was crucial during its operation.

Rate Limiting Algorithms

1. Token Bucket (Most Popular)

Bucket holds tokens. Tokens refill at fixed rate. Request consumes token.

Characteristics:

Allows bursts (up to bucket capacity) - Can handle sudden spikes in traffic up to bucket size
Smooths to average rate over time - Over long periods, rate averages to refill rate
Simple to implement - Straightforward algorithm with clear logic
Most popular algorithm - Widely used and well-understood

Algorithm:

Initialize bucket with capacity tokens
Refill tokens at refill_rate per second
When request arrives:
- If tokens available → consume token, allow request
- If no tokens → reject request (429)

Token Bucket Implementation

2. Leaky Bucket

Bucket holds requests. Requests leak out at fixed rate. If full, reject.

Characteristics:

Smooths traffic to constant rate - Requests are processed at a steady, predictable rate
No bursts allowed - Cannot handle sudden spikes in traffic like token bucket
Predictable output rate - Output rate is constant, making it easier to plan capacity
Less flexible than token bucket - Cannot accommodate bursts, which may be needed for some use cases

Algorithm:

Initialize bucket with capacity (queue size)
Process requests at leak_rate per second
When request arrives:
- If bucket has space → add to bucket
- If bucket full → reject (429)

Leaky Bucket Implementation

3. Sliding Window

Tracks requests in sliding time window. More accurate than fixed window.

Characteristics:

More accurate than fixed window - Provides more precise rate limiting without boundary bursts
No burst at boundaries - Eliminates the burst problem that occurs in fixed window at window boundaries
More memory intensive (stores timestamps) - Requires storing timestamps for each request, increasing memory usage
More complex to implement - Requires more sophisticated logic to track sliding windows

Sliding Window Implementation

4. Fixed Window

Divides time into fixed windows. Simple but allows bursts.

Characteristics:

Simple to implement - Straightforward logic with minimal complexity
Low memory usage - Only needs to store a counter per window, very efficient
Allows bursts at boundaries - Can allow double the limit when windows reset (e.g., 100 requests at 10:00:59 and 10:01:00)
Less accurate - Less precise than sliding window due to boundary burst issue

Fixed Window Implementation

Distributed Rate Limiting

For multiple servers, use Redis:

Rate Limiting Strategies

Rate limiting can be applied at different levels depending on your use case. Here are common strategies with real-world examples:

By Client IP

Use case: Public APIs where you don’t have user authentication, or as a first line of defense.

Example: A public weather API limits each IP to 100 requests/hour. This prevents a single user from scraping all weather data while allowing legitimate usage.

1
rate_limiter = TokenBucket(capacity=100, refill_rate=10)
2
client_ip = request.remote_addr
3

4
if not rate_limiter.is_allowed(client_ip):
5
    return "Rate limit exceeded", 429

By User ID

Use case: Authenticated APIs where you want to limit per-user usage, regardless of which device or IP they use.

Example: A social media API limits each authenticated user to 1,000 posts/day. This prevents spam while allowing legitimate users to post from multiple devices (phone, tablet, desktop).

Real-world scenario: A user tries to post 1,500 times in one day. After 1,000 posts, all subsequent requests return 429 until the next day. This protects the platform from spam while being fair to legitimate users.

By API Key

Use case: Third-party integrations where each application gets its own API key with specific limits.

Example: A payment processing API provides each merchant with an API key. Free tier merchants get 1,000 transactions/month, while enterprise merchants get 100,000 transactions/month.

Real-world scenario: An e-commerce platform integrates with a payment API. They receive an API key with a 10,000 requests/day limit. During peak shopping season, they might hit this limit and need to upgrade their plan or implement request queuing.

Tiered Limits

Use case: SaaS products with multiple pricing tiers. Each tier gets different rate limits as part of the subscription.

Example: A cloud storage API offers three tiers:

Free: 100 API calls/hour
Pro: 1,000 API calls/hour
Enterprise: 10,000 API calls/hour

Real-world scenario: A file backup application uses the API. Free users can sync files 100 times per hour, which is sufficient for personal use. Pro users (developers) get 1,000 calls/hour, enough for automated backups. Enterprise customers get 10,000 calls/hour for large-scale operations.

Business value: Tiered limits encourage upgrades. Users hitting free tier limits often upgrade to Pro, increasing revenue.

By Endpoint/Resource

Use case: Different API endpoints have different costs. Expensive operations get lower limits.

Example: A machine learning API:

Image classification: 100 requests/minute (cheap, fast)
Video processing: 10 requests/minute (expensive, slow)
Model training: 1 request/hour (very expensive, takes time)

Real-world scenario: A video editing app uses the ML API. Users can classify images quickly (100/min), but video processing is limited to 10/min to prevent resource exhaustion. Model training requests are queued and processed one at a time.

1
# Different limits per endpoint
2
endpoint_limits = {
3
    '/api/classify-image': TokenBucket(100, 100/60),  # 100/min
4
    '/api/process-video': TokenBucket(10, 10/60),     # 10/min
5
    '/api/train-model': TokenBucket(1, 1/3600),       # 1/hour
6
}
7

8
endpoint = request.path
9
if not endpoint_limits[endpoint].is_allowed(user_id):
10
    return "Rate limit exceeded", 429

By Geographic Region

Use case: Global APIs that want to distribute load or comply with regional regulations.

Example: A content delivery API:

US/EU: 1,000 requests/second
Asia: 500 requests/second (smaller infrastructure)
Other regions: 100 requests/second

Real-world scenario: A global news aggregator API serves different regions. Traffic from high-traffic regions (US/EU) gets higher limits, while emerging markets get lower limits initially, scaling up as infrastructure grows.

1
# Regional limits
2
regional_limits = {
3
    'us': TokenBucket(1000, 1000),
4
    'eu': TokenBucket(1000, 1000),
5
    'asia': TokenBucket(500, 500),
6
    'other': TokenBucket(100, 100),
7
}
8

9
region = get_region_from_ip(request.remote_addr)
10
if not regional_limits[region].is_allowed(request.remote_addr):
11
    return "Rate limit exceeded", 429

Dynamic Rate Limiting

Use case: Adjust limits based on system load or user behavior.

Example: During normal load, users get 100 requests/minute. During high load, limits drop to 50 requests/minute to protect the system. Trusted users (good history) get 150 requests/minute.

Real-world scenario: A ride-sharing API dynamically adjusts limits. During rush hour (high load), all users get reduced limits. During off-peak hours, limits increase. Users with good payment history get higher limits.

1
def get_dynamic_limit(user_id):
2
    base_limit = 100  # requests/minute
3

4
    # Adjust based on system load
5
    current_load = get_system_load()
6
    if current_load > 0.8:  # High load
7
        base_limit = base_limit * 0.5  # Reduce to 50
8

9
    # Adjust based on user trust
10
    user_trust_score = get_user_trust_score(user_id)
11
    if user_trust_score > 0.9:  # Trusted user
12
        base_limit = base_limit * 1.5  # Increase to 75-150
13

14
    return TokenBucket(int(base_limit), base_limit/60)
15

16
limiter = get_dynamic_limit(user_id)
17
if not limiter.is_allowed(user_id):
18
    return "Rate limit exceeded", 429

HTTP Response Headers

Inform clients about rate limits:

1
HTTP/1.1 200 OK
2
X-RateLimit-Limit: 100
3
X-RateLimit-Remaining: 95
4
X-RateLimit-Reset: 1640995200

When limit exceeded:

1
HTTP/1.1 429 Too Many Requests
2
X-RateLimit-Limit: 100
3
X-RateLimit-Remaining: 0
4
X-RateLimit-Reset: 1640995200
5
Retry-After: 60

Algorithm Comparison

Algorithm	Bursts	Accuracy	Memory	Complexity
Token Bucket	Yes	High	Low	Medium
Leaky Bucket	No	High	Medium	Medium
Sliding Window	No	Very High	High	High
Fixed Window	Yes	Medium	Low	Low

Recommendation: Use Token Bucket for most cases. It’s simple, accurate, and allows bursts.

Key Takeaways

🪣 Token Bucket: Most Popular

Token bucket allows bursts, smooths to average rate. Most widely used algorithm.

📊 Sliding Window: Most Accurate

Sliding window is most accurate but uses more memory. Use when accuracy is critical.

🌐 Distributed: Use Redis

For multiple servers, use Redis with Lua scripts for atomic operations.

🔢 Return 429

When rate limit exceeded, return HTTP 429 with Retry-After header. Inform clients about limits.

Next Steps

Review API Gateway - rate limiting is often implemented in gateways
Learn Caching - cache rate limit data for performance
Understand Distributed Systems - distributed rate limiting challenges

Request a feature or report an issue