Post

Rate Limiting Algorithms

Protect services from overload and abuse by controlling request rates -- token bucket is the industry standard, sliding window counter provides accuracy at scale.

Rate Limiting Algorithms

Protect services from overload and abuse by controlling request rates — token bucket is the industry standard, sliding window counter provides accuracy at scale.

Key Properties Table

Algorithm Time Complexity Memory Allows Burst Edge Burst Issue Accuracy Error Rate
Fixed Window O(1) O(1) ❌ Major Low ~2% (edge case)
Leaky Bucket O(1) per op O(queue) ❌ Smoothed ✅ None High ~0.1%
Token Bucket O(1) O(1) ✅ Controlled ✅ None High ~0.1%
Sliding Window Log O(n) O(n) ✅ None Perfect 0%
Sliding Window Counter O(1) O(1) ✅ None ~Perfect ~0.003%

When to Use / Avoid

Fixed Window Counter

  • ✅ Use when: Simple implementations, edge bursts are acceptable, stateless counters required
  • ❌ Avoid: High-traffic APIs, accuracy-critical scenarios, protecting against DDoS

Leaky Bucket

  • ✅ Use when: Smooth, constant-rate output required, protecting downstream systems from spikes
  • ❌ Avoid: Burst traffic (queuing delays), user-facing APIs (delayed requests frustrate users)
  • ✅ Use when: API rate limiting, burst traffic allowed, industry-standard approach needed, cost control required
  • ❌ Avoid: Extremely strict constant-rate requirements (use leaky bucket instead)

Sliding Window Log

  • ✅ Use when: Low-traffic scenarios, accuracy critical, request timestamps needed for auditing
  • ❌ Avoid: High-traffic APIs (memory overhead O(n) per user), single machine deployments

Sliding Window Counter

  • ✅ Use when: High-traffic, high-accuracy requirements, distributed systems, memory-sensitive deployments
  • ❌ Avoid: Scenarios requiring perfect accuracy (0.003% error may be unacceptable)

Real Systems Table

System Algorithm Config Production Detail
AWS API Gateway Token Bucket 10K RPS default, 5K burst Account-level throttling with per-method overrides
Stripe API Token Bucket 25 RPS default per-key Redis-backed, per-customer buckets with burst allowance
Cloudflare Sliding Window Counter Per-domain, per-IP rules 400M+ requests, 0.003% error rate, PoP-local counters
Nginx limit_req Leaky Bucket Configurable rate + burst burst parameter with nodelay optimization
Kong Gateway Configurable Multiple algorithms Token bucket, sliding window, fixed window per plugin
GitHub API Token Bucket 5000/hour (authenticated) X-RateLimit headers, exponential backoff recommended

How Real Systems Use This

AWS API Gateway

AWS API Gateway implements rate limiting using the token bucket algorithm with account-level and stage-level quotas. The default account limit is 10,000 requests per second with a burst capacity of 5,000 requests. When a client exceeds the limit, API Gateway returns a 429 “Too Many Requests” response. Organizations can customize throttling per method by setting a stage-level rate (tokens added per second) and burst size (bucket capacity). The token bucket approach allows brief traffic spikes from legitimate clients while preventing sustained abuse. For example, a mobile app experiencing sudden user surge can consume burst tokens without immediate rejection, then refill at the configured rate.

Stripe API

Stripe uses token bucket rate limiting configured at the per-API-key level, with a default limit of 25 requests per second for each authenticated API key. Each API key has its own token bucket stored in Redis, allowing Stripe to track refill timestamps and current token count. Stripe returns rate limit information in response headers (e.g., X-RateLimit-Remaining), enabling clients to implement intelligent backoff. The system allows brief bursts above the 25 RPS baseline, so a batch operation can momentarily spike to 40 RPS, then throttle back. This design balances customer flexibility with system protection — Stripe can sustain millions of API keys by using per-key isolation and horizontal scaling across Redis clusters.

Cloudflare

Cloudflare operates rate limiting at massive scale (processing 45M+ requests per second globally) using a sliding window counter algorithm for high accuracy and low memory overhead. Rather than storing individual request timestamps (which would consume O(n) memory), Cloudflare maintains just two counters per rate limit key: the count from the previous time window and the count from the current window. When a request arrives, the system calculates a weighted average: rate = prev_count × (1 − elapsed_time / window_size) + current_count. This approach achieves 0.003% error rate — accurate enough for real-world rate limiting. Cloudflare implements this logic inside each Point of Presence (PoP) to avoid centralized bottlenecks; each edge location independently tracks rate limits for its incoming traffic, then reports aggregates.

Nginx limit_req

Nginx implements rate limiting using the leaky bucket algorithm via the limit_req_zone and limit_req directives. A typical configuration: limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s; limit_req zone=api burst=20 nodelay; allows 10 requests per second per IP address with a burst capacity of 20 requests. Without the nodelay parameter, Nginx queues bursting requests and delays them to enforce the 10 RPS rate — if 30 requests arrive at once, Nginx immediately allows 20 (filling burst), queues 10, then releases them at 10 RPS (one per 100ms). With nodelay, all 30 are sent to the backend immediately but marked as “burst” — Nginx accounts for them and future requests are throttled. This approach smooths outgoing load to backends, protecting them from sudden traffic spikes.

Kong Gateway

Kong is an API gateway that supports multiple configurable rate limiting algorithms via its rate-limiting plugin: token bucket (most common), sliding window counter, and fixed window counter. Administrators can choose per route or per service. For example, a public API endpoint might use token bucket with 100 RPS, while an internal endpoint uses sliding window counter with 10,000 RPS to support batch operations. Kong stores state in Redis for distributed scaling — multiple Kong nodes share the same rate limit buckets across a Redis cluster. The plugin returns rate limit headers (X-RateLimit-Remaining, X-RateLimit-Reset) so clients can implement adaptive behavior.

GitHub API

GitHub API implements token bucket rate limiting allowing 5,000 requests per hour for authenticated users (60 requests per minute) and 60 requests per hour for unauthenticated users. The response includes headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset. When a user approaches their limit, GitHub recommends exponential backoff: wait 1 second, then 2, 4, 8 seconds before retrying, so legitimate clients slow down gracefully. The token refill happens once per hour — at the reset timestamp, the user’s token bucket is replenished. This design encourages clients to batch requests (e.g., fetch 100 repositories in one call) rather than making individual API calls.

Implementation

Token Bucket with Docstrings

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import time
from typing import Dict

class TokenBucket:
    """
    A token bucket rate limiter implementation.

    Tokens are added to the bucket at a fixed refill_rate (tokens per second).
    A request consumes one token; if no tokens available, request is rejected.
    The bucket capacity is limited to max_tokens to prevent unlimited accumulation.
    """

    def __init__(self, capacity: int, refill_rate: float):
        """
        Initialize the token bucket.

        Args:
            capacity: Maximum tokens in the bucket (burst size).
            refill_rate: Tokens added per second (rate).
        """
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.tokens = capacity
        self.last_refill_time = time.time()

    def _refill(self):
        """
        Refill tokens based on elapsed time since last refill.

        Calculates how many tokens to add: (elapsed_seconds × refill_rate).
        Caps tokens at capacity to prevent unlimited accumulation.
        """
        now = time.time()
        elapsed = now - self.last_refill_time
        tokens_to_add = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + tokens_to_add)
        self.last_refill_time = now

    def allow_request(self, tokens_needed: int = 1) -> bool:
        """
        Attempt to consume tokens for a request.

        Returns:
            True if tokens are available and consumed (request allowed).
            False if insufficient tokens (request rejected / rate limited).
        """
        self._refill()
        if self.tokens >= tokens_needed:
            self.tokens -= tokens_needed
            return True
        return False


class RateLimiter:
    """
    Multi-user rate limiter using token buckets.

    Each user (identified by a key) has their own independent token bucket.
    This is how AWS API Gateway, Stripe, and GitHub API implement per-user limits.
    """

    def __init__(self, capacity: int, refill_rate: float):
        """
        Initialize the rate limiter.

        Args:
            capacity: Burst capacity per user.
            refill_rate: Tokens per second per user.
        """
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.buckets: Dict[str, TokenBucket] = {}

    def is_allowed(self, user_id: str, tokens_needed: int = 1) -> bool:
        """
        Check if a user can make a request.

        Creates a new bucket on first request; reuses existing buckets.
        This lazy initialization pattern is used in production systems
        to avoid pre-allocating buckets for inactive users.
        """
        if user_id not in self.buckets:
            self.buckets[user_id] = TokenBucket(self.capacity, self.refill_rate)
        return self.buckets[user_id].allow_request(tokens_needed)


# Example usage
if __name__ == "__main__":
    # Stripe-like: 25 requests/sec per key, burst of 50
    limiter = RateLimiter(capacity=50, refill_rate=25.0)

    print("Stripe API Rate Limiter Simulation (25 RPS, burst=50)")
    print("-" * 50)

    # Simulate burst from user_123
    for i in range(60):
        allowed = limiter.is_allowed("user_123")
        status = "✅ Allowed" if allowed else "❌ Rejected"
        print(f"Request {i+1}: {status}")
        if i == 49:
            print("... [tokens exhausted, waiting for refill] ...")
            time.sleep(1.0)  # Wait 1 second for tokens to refill

Sliding Window Counter Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import time
from typing import Dict, Tuple

class SlidingWindowCounter:
    """
    Sliding window counter: hybrid approach for high accuracy with O(1) memory.

    Maintains two counters:
    - current_window: requests in the current time window
    - prev_window: requests in the previous time window

    Estimated rate = prev_window × (1 - elapsed/window_size) + current_window

    This is the algorithm Cloudflare uses to achieve 0.003% error with 400M+ RPS.
    """

    def __init__(self, window_size: int, max_requests: int):
        """
        Initialize the sliding window counter.

        Args:
            window_size: Time window in seconds (e.g., 1 second).
            max_requests: Maximum allowed requests per window.
        """
        self.window_size = window_size
        self.max_requests = max_requests
        self.current_window_count = 0
        self.prev_window_count = 0
        self.window_start_time = time.time()

    def _slide_window(self):
        """
        Slide the window forward if time elapsed exceeds window_size.

        When sliding, the "current" window becomes "previous" and
        a new empty "current" window begins.
        """
        now = time.time()
        elapsed = now - self.window_start_time

        if elapsed >= self.window_size:
            # Move current to previous and reset current
            self.prev_window_count = self.current_window_count
            self.current_window_count = 0
            self.window_start_time = now

    def _get_estimated_count(self) -> float:
        """
        Calculate the estimated request count using weighted average.

        Formula: prev_count × (1 - elapsed/window_size) + current_count

        This estimates how many requests occurred in a sliding window
        that ended at the current moment, without storing individual timestamps.
        """
        now = time.time()
        elapsed = now - self.window_start_time
        weight = 1.0 - (elapsed / self.window_size)
        return self.prev_window_count * weight + self.current_window_count

    def is_allowed(self) -> bool:
        """
        Check if a request is allowed.

        Returns:
            True if estimated count < max_requests (allow request).
            False if estimated count >= max_requests (reject request).
        """
        self._slide_window()
        estimated_count = self._get_estimated_count()

        if estimated_count < self.max_requests:
            self.current_window_count += 1
            return True
        return False


class DistributedRateLimiter:
    """
    Distributed sliding window counter (like Cloudflare per-PoP).

    Each "location" (data center / edge) has its own counter.
    Production systems would sync these counters across locations,
    but for high-frequency limits, local-only counters work well.
    """

    def __init__(self, window_size: int, max_requests: int):
        self.counters: Dict[str, SlidingWindowCounter] = {}
        self.window_size = window_size
        self.max_requests = max_requests

    def is_allowed(self, key: str) -> bool:
        """Check if request from key is allowed (per-location)."""
        if key not in self.counters:
            self.counters[key] = SlidingWindowCounter(self.window_size, self.max_requests)
        return self.counters[key].is_allowed()


# Example usage
if __name__ == "__main__":
    # Cloudflare-like: 1000 requests per 1-second window
    counter = SlidingWindowCounter(window_size=1, max_requests=1000)

    print("Sliding Window Counter Simulation (1000 RPS)")
    print("-" * 50)

    for i in range(1100):
        allowed = counter.is_allowed()
        status = "✅ Allowed" if allowed else "❌ Rejected"
        if i % 100 == 0 or not allowed:
            print(f"Request {i+1}: {status}")

Comparison & Trade-offs

Fixed Window Counter

  • Mechanism: Increment a counter; reset every N seconds
  • Pros: Simplest implementation (one counter per user)
  • Cons: Edge burst problem — 2× requests possible at window boundary
  • When unacceptable: Any production API (edge case is a real issue)

Leaky Bucket

  • Mechanism: Requests go into a queue; exit at constant rate
  • Pros: Smooth output, no bursty downstream traffic
  • Cons: Legitimate burst requests get delayed or queued frustratingly
  • When unacceptable: User-facing APIs (perceived slowness)

Token Bucket

  • Mechanism: Tokens refill at fixed rate; each request costs one token
  • Pros: Allows controlled bursts, simple, O(1) memory, industry standard
  • Cons: None significant (this is why everyone uses it)
  • Winner: Most common for API gateways (AWS, Stripe, GitHub, Kong)

Sliding Window Log

  • Mechanism: Store exact timestamp of every request in window
  • Pros: Perfect accuracy (no edge cases or approximations)
  • Cons: O(n) memory per user (stores every timestamp)
  • When unacceptable: High-traffic systems (too much memory)

Sliding Window Counter

  • Mechanism: Two counters (current + previous window) with weighted calculation
  • Pros: Perfect accuracy + O(1) memory + distributable
  • Cons: 0.003% error (negligible in practice)
  • Winner: Cloudflare’s choice for 45M+ RPS scale

References

This post is licensed under CC BY 4.0 by the author.