.uTechUnfiltered  .dev
System Design#rate-limiting#redis#distributed-systems#api-design#scalability

How to Design a Rate Limiter That Actually Works at Scale

Raunak Gupta
Mar 22, 202628 min readUpdated 2 days ago
Share:

Rate limiting looks simple - until you deploy it across multiple servers and it starts failing silently.

Most backend engineers implement rate limiting on a single server, run a few tests, and call it done. Then traffic grows. You add more instances behind a load balancer. Suddenly, users are blowing past their limits because each server is counting independently. No errors. No alerts. Just a quietly broken system letting 5x the intended traffic through.

This is not a tutorial on "what is rate limiting." If you need that, you're in the wrong place. This article covers what breaks in production, the real implementation choices you'll face, and how to build a rate limiter that doesn't fall apart when it actually matters.


Why Single-Node Rate Limiting Is a Lie

On a single server, rate limiting is trivial. You keep a counter in memory, increment it per request, and reject anything above the threshold. It works perfectly - in development.

Here's where it falls apart:

Multiple application instances. The moment you have two or more servers behind a load balancer, each instance maintains its own counter. A user with a 100 requests/minute limit effectively gets 100 × N, where N is the number of instances. Your limit is meaningless.

Distributed traffic patterns. Requests from the same user don't always hit the same server. Sticky sessions help, but they introduce their own scaling problems and aren't reliable under high load.

Shared limits per user or API key. If user abc123 has a global limit of 1000 requests/hour, every instance needs to agree on the current count. Without a shared state, this is impossible.

The bottom line: If your rate limiter doesn't share state across instances, it's not a rate limiter. It's a suggestion.


The Real Problems at Scale

Before jumping into solutions, you need to understand what actually goes wrong. These are the issues that show up in production, not in whiteboard interviews.

1. Inconsistent Counters

Two instances read the counter at the same time, both see 99 out of a 100 limit, and both allow the request. Now you've served 101. This is a classic race condition, and it gets worse under load.

2. Race Conditions on Increment

Even with a shared store like Redis, a naive read-then-write pattern creates a window where concurrent requests can slip through. If you're doing GET → check → SET, you're vulnerable.

3. Burst Traffic Bypassing Limits

Fixed window algorithms are especially bad here. If a user sends 100 requests at 11:59:59 and another 100 at 12:00:01, they've sent 200 requests in 2 seconds - but both windows see only 100. Perfectly "within limits."

4. Redis Becoming a Bottleneck

Everyone reaches for Redis as the shared counter store. It works great - until it doesn't. At high request volumes, Redis can become the single point of failure for your entire rate limiting layer. CPU spikes, connection pool exhaustion, and increased latency under load are all real issues.

5. Clock Drift Across Systems

If your servers disagree on what time it is - even by a few hundred milliseconds - your time-window-based counters become unreliable. In distributed systems, clocks are never perfectly synchronized.


Rate Limiting Algorithms: What You Actually Need to Know

You don't need a deep dive into every algorithm. You need to know when each one breaks so you can pick the right one for your use case.

Fixed Window Counter

How it works: Divide time into fixed windows (e.g., 1-minute intervals). Count requests in the current window. Reject if the count exceeds the limit.

When it breaks: Boundary burst problem. A user can send double the allowed requests by timing them around the window boundary. If your limit is 100/min and a user sends 100 requests at XX:59 and 100 at XX+1:00, they've sent 200 in ~1 second.

plaintext
Window 1: [XX:00 ---- XX:59]  →  100 requests (OK)
Window 2: [XX+1:00 ---- XX+1:59]  →  100 requests (OK)

Reality: 200 requests in 2 seconds. Your "100/min" limit is useless.

Use it when: You need something dead simple, the burst issue is acceptable, and precision isn't critical.


Sliding Window Log

How it works: Store the timestamp of every request. When a new request comes in, count all timestamps within the last N seconds. Reject if the count exceeds the limit.

When it breaks: Memory. Storing individual timestamps for every request, for every user, gets expensive fast. If you have 1 million users each making 100 requests/minute, you're storing 100 million timestamps. Your Redis memory usage will spike hard.

Use it when: You need high accuracy and have a relatively small number of rate-limited entities.


Sliding Window Counter

How it works: A hybrid approach. Uses the current window count plus a weighted portion of the previous window count to approximate a sliding window.

plaintext
Effective count = (previous window count × overlap percentage) + current window count

Example:
- Previous window: 80 requests
- Current window: 30 requests
- We're 25% into the current window (75% overlap with previous)
- Effective count = (80 × 0.75) + 30 = 90

When it breaks: It's an approximation. Under very specific burst patterns, it can be slightly too lenient or too strict. For most use cases, this trade-off is worth it.

Use it when: You want better accuracy than fixed window without the memory cost of sliding log. This is the sweet spot for most applications.


Token Bucket

How it works: Each user has a "bucket" with a maximum number of tokens. Tokens are added at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected.

When it breaks: Implementation complexity in distributed systems. Maintaining accurate token counts across multiple instances with proper refill timing requires careful synchronization.

Use it when: You need to allow controlled bursts. A user who hasn't made requests in a while should be able to make several quick requests. This maps well to API rate limits for paid tiers.


Leaky Bucket

How it works: Requests enter a queue (the bucket) and are processed at a fixed rate. If the queue is full, new requests are dropped.

When it breaks: It enforces a strict output rate, which means even legitimate burst traffic gets queued or dropped. For APIs where response time matters, the queueing introduces latency.

Use it when: You need a constant, predictable processing rate - like sending webhook deliveries or processing background jobs.


Algorithm Comparison at a Glance

AlgorithmBurst HandlingMemory UsageAccuracyComplexity
Fixed WindowPoor - boundary exploitLowLowVery simple
Sliding Window LogExcellentHigh - stores every timestampVery highModerate
Sliding Window CounterGood - approximateLowGood enoughModerate
Token BucketExcellent - controlled burstsLowHighModerate-High
Leaky BucketStrict - no bursts allowedLowHighModerate

My recommendation: Start with sliding window counter for most use cases. Move to token bucket if you need burst control. The other algorithms are either too simple (fixed window) or too expensive (sliding log) for production at scale.


What Actually Works in Production

Theory is one thing. Here's what you'll actually deploy.

Option A: Redis + Atomic Operations

This is the most common approach, and for good reason. Redis is fast, widely supported, and has built-in atomic operations that solve the race condition problem.

The naive approach (don't do this):

Naive approach - race condition
# ❌ WRONG - race condition between GET and SET
count = redis.get(f"rate_limit:{user_id}")
if count and int(count) >= LIMIT:
    return 429  # Too Many Requests
redis.incr(f"rate_limit:{user_id}")
redis.expire(f"rate_limit:{user_id}", WINDOW_SIZE)

The gap between get and incr is a race condition. Two concurrent requests can both read the same count and both pass the check.

The correct approach - atomic INCR + EXPIRE:

Atomic increment - correct
# ✅ CORRECT - atomic increment, no race condition
key = f"rate_limit:{user_id}:{current_window}"
count = redis.incr(key)
if count == 1:
    redis.expire(key, WINDOW_SIZE)
if count > LIMIT:
    return 429

This works because INCR is atomic in Redis - it increments and returns the new value in a single operation. No read-then-write race.

But there's still a problem. The INCR and EXPIRE are two separate commands. If your service crashes between them, the key lives forever. Fix this with a Lua script:

Atomic rate limiting - Lua script
-- Lua script for atomic rate limiting in Redis
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local count = redis.call('INCR', key)
if count == 1 then
    redis.call('EXPIRE', key, window)
end

if count > limit then
    return 0  -- rejected
end
return 1  -- allowed

Lua scripts in Redis are executed atomically. No partial execution, no race conditions. This is the baseline implementation you should use.

Key design:

plaintext
rate_limit:{user_id}:{window_timestamp}

Example: rate_limit:user_123:1700000000 where the timestamp represents the start of the current window.

Problems with this approach:

  • Hot keys. If one user generates massive traffic, all their requests hit the same Redis key. This can cause uneven load distribution across Redis cluster slots.
  • Redis latency under load. Every single request now requires a Redis round-trip. At 50,000+ requests/second, that latency adds up.
  • Single point of failure. If Redis goes down, your rate limiting goes with it. What's your fallback?

Option B: Distributed Rate Limiting at the Gateway

Here's an insight that saves a lot of complexity: offloading rate limiting to the edge or API gateway reduces application complexity dramatically.

Instead of implementing rate limiting in your application code, push it to:

  • AWS API Gateway - built-in throttling with per-key, per-method limits
  • Cloudflare Rate Limiting - edge-level, before traffic even hits your infrastructure
  • Kong / NGINX - configurable rate limiting plugins
  • Envoy Proxy - supports both local and global rate limiting

Why this works better than you think:

  1. Rate limiting happens before your application processes the request. No wasted compute on requests that should be rejected.
  2. Gateway-level solutions are purpose-built for this. They handle distributed counting, clock synchronization, and failover internally.
  3. Your application code stays clean. No Redis connections for rate limiting, no Lua scripts, no counter management.

Example - NGINX rate limiting configuration:

nginx.conf
http {
    # Define a rate limiting zone
    # $binary_remote_addr = client IP (binary form, saves memory)
    # zone=api_limit:10m = 10MB shared memory zone (~160,000 IPs)
    # rate=10r/s = 10 requests per second per IP
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

    server {
        location /api/ {
            # burst=20 allows 20 requests to queue
            # nodelay processes burst requests immediately
            limit_req zone=api_limit burst=20 nodelay;
            limit_req_status 429;

            proxy_pass http://backend;
        }
    }
}

Example - AWS API Gateway throttling (via CloudFormation):

cloudformation.yaml
Resources:
  ApiUsagePlan:
    Type: AWS::ApiGateway::UsagePlan
    Properties:
      Throttle:
        RateLimit: 100      # requests per second
        BurstLimit: 200     # maximum concurrent requests
      Quota:
        Limit: 10000        # requests per day
        Period: DAY

My recommendation: Use gateway-level rate limiting as your first line of defense. Add application-level limiting only for business-logic-specific rules that the gateway can't handle.


Option C: Token Bucket - The Best Practical Model for APIs

If you're building a public API with tiered rate limits (free, pro, enterprise), the token bucket algorithm is the best fit.

It handles bursts gracefully. A user who hasn't made requests in 30 seconds has accumulated tokens. They can make a quick burst of requests without hitting limits.

Implementation with Redis:

token_bucket.py
import time
import redis

def check_rate_limit(redis_client, user_id, max_tokens, refill_rate):
    """
    Token bucket rate limiter using Redis.

    max_tokens: Maximum burst size (bucket capacity)
    refill_rate: Tokens added per second
    """
    key = f"token_bucket:{user_id}"
    now = time.time()

    # Lua script for atomic token bucket
    lua_script = """
    local key = KEYS[1]
    local max_tokens = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])

    local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
    local tokens = tonumber(bucket[1])
    local last_refill = tonumber(bucket[2])

    -- Initialize bucket if it doesn't exist
    if tokens == nil then
        tokens = max_tokens
        last_refill = now
    end

    -- Calculate tokens to add based on elapsed time
    local elapsed = now - last_refill
    local new_tokens = elapsed * refill_rate
    tokens = math.min(max_tokens, tokens + new_tokens)

    -- Try to consume a token
    local allowed = 0
    if tokens >= 1 then
        tokens = tokens - 1
        allowed = 1
    end

    -- Update bucket state
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)

    return {allowed, math.floor(tokens)}
    """

    result = redis_client.eval(lua_script, 1, key, max_tokens, refill_rate, now)
    allowed, remaining = result

    return {
        "allowed": bool(allowed),
        "remaining": remaining,
        "limit": max_tokens,
        "retry_after": None if allowed else 1.0 / refill_rate
    }


# Usage per tier
RATE_LIMITS = {
    "free":       {"max_tokens": 10,  "refill_rate": 1},    # 1 req/sec, burst of 10
    "pro":        {"max_tokens": 50,  "refill_rate": 10},   # 10 req/sec, burst of 50
    "enterprise": {"max_tokens": 200, "refill_rate": 100},  # 100 req/sec, burst of 200
}

Why token bucket wins for tiered APIs:

  • The max_tokens parameter directly maps to burst allowance per tier
  • The refill_rate maps to sustained request rate
  • Both are independently tunable per plan
  • Users get clear, predictable behavior

The Trade-Offs - Choosing Your Approach

Every solution has costs. Here's the honest breakdown:

ApproachProsConsBest For
Fixed Window + RedisDead simple, low memoryBoundary burst exploitInternal services
Sliding Window CounterGood accuracy, low memorySlightly complex, approximateGeneral-purpose API limits
Token Bucket + RedisBurst-friendly, intuitiveMore complex implementationPublic APIs with tiers
Gateway-LevelNo app code, handles scaleLess control over complex rulesFirst line of defense
Sliding Window LogExtremely accurateHigh memory, expensivePrecision-critical limits

There is no universally correct choice. The right approach depends on your traffic volume, accuracy requirements, and how much operational complexity you're willing to take on.


What Breaks at Scale - The Section Nobody Talks About

Redis CPU Spikes

Rate limiting means hitting Redis on every single request. At 50,000 requests/second, that's 50,000 Redis operations per second just for rate limiting.

Mitigation strategies:

  • Dedicated Redis instance for rate limiting. Don't share with your cache or session store.
  • Redis Cluster for horizontal sharding. Distribute keys across multiple nodes.
  • Local caching with periodic sync. Keep a local counter, sync to Redis every N requests. Trade accuracy for performance.
Hybrid rate limiter - local + Redis
# Local counter with periodic Redis sync
# Trades strict accuracy for massive performance gain
import threading
import time

class HybridRateLimiter:
    def __init__(self, redis_client, sync_interval=1.0):
        self.redis = redis_client
        self.local_counts = {}  # {key: count}
        self.sync_interval = sync_interval
        self._start_sync_thread()

    def check(self, key, limit):
        # Check local count first - no Redis call
        local = self.local_counts.get(key, 0)
        if local >= limit:
            return False
        self.local_counts[key] = local + 1
        return True

    def _sync_to_redis(self):
        while True:
            time.sleep(self.sync_interval)
            counts = self.local_counts.copy()
            self.local_counts.clear()
            for key, count in counts.items():
                self.redis.incrby(f"rate:{key}", count)

Key Explosion

If you're creating a Redis key per user per time window, the math gets ugly fast.

plaintext
1 million users × 60 windows/hour = 60 million keys per hour

Mitigation:

  • Use short, binary-efficient key names. rl:u123:1700000 instead of rate_limit:user_id_123:window_1700000000.
  • Set aggressive TTLs. Don't keep keys longer than 2× your window size.
  • Monitor INFO memory and dbsize in Redis.
  • Consider hash-based grouping. Store multiple users' counts in a single Redis hash to reduce per-key overhead.

Clock Drift Across Systems

If server A thinks it's 12:00:00.000 and server B thinks it's 11:59:59.700, their window calculations will disagree.

Mitigation:

  • Use the Redis server's clock, not the application server's. Let Redis compute timestamps via Lua scripts with redis.call('TIME').
  • Use larger windows. A 300ms drift matters for a 1-second window. It's irrelevant for a 60-second window.
  • Run NTP with tight sync intervals across all servers.

Network Latency Affecting Limits

Every rate limit check requires a network round-trip to Redis. Under normal conditions, this is 0.5-2ms. During network congestion or Redis failover, it can spike to 50-200ms.

The cascading failure scenario:

  1. Redis latency increases to 100ms
  2. Request threads block waiting for rate limit checks
  3. Thread pool exhausts
  4. Application starts rejecting all requests - not because of rate limits, but because it can't check them

Mitigation:

  • Set aggressive timeouts on Redis connections (5-10ms).
  • Fail open, not closed. If Redis is unreachable, allow the request. Rate limiting is a protection mechanism, not a gatekeeper.
  • Circuit breaker pattern. After N consecutive Redis failures, stop trying for M seconds.
Fail-open rate limiter with circuit breaker
# Fail-open rate limiter with circuit breaker
class ResilientRateLimiter:
    def __init__(self, redis_client, failure_threshold=5, recovery_time=30):
        self.redis = redis_client
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_time = recovery_time
        self.circuit_open_until = 0

    def check(self, key, limit):
        # Circuit is open - skip Redis, allow request
        if time.time() < self.circuit_open_until:
            return True  # fail open

        try:
            result = self._check_redis(key, limit)
            self.failure_count = 0  # reset on success
            return result
        except (redis.ConnectionError, redis.TimeoutError):
            self.failure_count += 1
            if self.failure_count >= self.failure_threshold:
                self.circuit_open_until = time.time() + self.recovery_time
            return True  # fail open

This is a critical design decision. Failing open feels wrong - you're letting unlimited traffic through. But the alternative is failing closed, which means a Redis hiccup takes down your entire service. In almost every case, failing open is the right call.


The Recommended Production Architecture

Layer 1: Edge / Gateway - Broad Protection

Use your CDN or API gateway for the first line of defense.

What it handles:

  • Per-IP rate limiting (e.g., 1000 requests/minute per IP)
  • Geographic blocking
  • Basic bot detection
  • DDoS mitigation

Tools: Cloudflare, AWS WAF, NGINX, or your load balancer's built-in throttling.

Layer 2: Application - Business Logic Limits

Use Redis-backed rate limiting in your application for fine-grained control.

What it handles:

  • Per-user limits based on subscription tier
  • Per-endpoint limits (e.g., stricter limits on write operations)
  • Per-organization limits for B2B APIs

Layer 3: Fallback - Graceful Degradation

python
if redis_available:
    enforce_rate_limit()
elif local_cache_available:
    enforce_approximate_limit()  # local counters
else:
    allow_request()  # fail open
    log_warning("Rate limiting disabled - Redis unreachable")
    increment_metric("rate_limit.fallback")

Critical insight: Always have a fallback strategy. "Redis never goes down" is not a strategy.

Architecture Diagram

plaintext
Client Request


┌─────────────────────┐
│  CDN / Edge Layer    │  ← Layer 1: IP-based limits, DDoS protection
│  (Cloudflare / AWS)  │     Drops ~90% of abusive traffic
└─────────┬───────────┘


┌─────────────────────┐
│  API Gateway /       │  ← Optional: Additional throttling
│  Load Balancer       │     Route-level rate limits
└─────────┬───────────┘


┌─────────────────────┐       ┌──────────────┐
│  Application Server  │──────▶│  Redis       │  ← Layer 2: User/tier limits
│  (Rate Limit Check)  │◀──────│  (Dedicated) │     Token bucket / sliding window
└─────────┬───────────┘       └──────────────┘
          │                          │
          │  Redis down?             │
          │  ──────────▶ Fail open   │
          │              + alert     │  ← Layer 3: Fallback

┌─────────────────────┐
│  Application Logic   │
│  (Process Request)   │
└─────────────────────┘

Advanced Considerations

Per-User vs. Per-IP Limiting

Per-IP is simple but breaks behind NATs, corporate proxies, and VPNs. Thousands of legitimate users can share a single IP.

Per-user (authenticated) is more accurate but doesn't protect against unauthenticated abuse.

The right answer is both. Per-IP at the gateway for unauthenticated traffic. Per-user in the application for authenticated requests.

Tier-Based Rate Limits

If you're running a SaaS with free and paid tiers, your rate limits are a product feature, not just an infrastructure concern.

Rate limits as product config
# Rate limit configuration as product feature
TIER_LIMITS = {
    "free": {
        "requests_per_minute": 60,
        "burst": 10,
        "daily_quota": 1000,
    },
    "pro": {
        "requests_per_minute": 600,
        "burst": 100,
        "daily_quota": 50000,
    },
    "enterprise": {
        "requests_per_minute": 6000,
        "burst": 1000,
        "daily_quota": None,  # unlimited
    },
}

Important: Make rate limits visible to users. Include headers in every response:

http
HTTP/1.1 200 OK
X-RateLimit-Limit: 600
X-RateLimit-Remaining: 542
X-RateLimit-Reset: 1700000060
Retry-After: 30           # Only on 429 responses

These headers are defined in the IETF RFC 6585 and draft-ietf-httpapi-ratelimit-headers specifications. Developers expect them. Not including them turns every rate limit hit into a debugging session for your users.


Common Mistakes to Avoid

1. Not setting rate limit headers on responses. Your users shouldn't have to guess their limits. Always return X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset.

2. Failing closed when Redis is down. A Redis outage should not take down your service. Fail open, log it, alert on it, and fix Redis.

3. Using the same Redis instance for rate limiting and caching. A cache stampede can starve your rate limiter. Dedicated instances for dedicated purposes.

4. Not rate limiting internal services. "It's internal" doesn't mean it can't overwhelm downstream dependencies. A buggy service can DDoS your database through an unprotected internal API.

5. Ignoring the cost of rate limiting itself. Every rate limit check is a Redis call. At scale, this adds latency to every request. Benchmark the overhead and optimize accordingly.

6. Hardcoding rate limits. Limits change. Tiers change. Make limits configurable - ideally through a config service or feature flags, not deployment.


Summary: What You Should Actually Do

If you're just starting out: Use Redis with the atomic INCR + Lua script approach. Sliding window counter algorithm. It's simple, well-understood, and handles most traffic patterns well.

If you're handling significant traffic: Add gateway-level rate limiting as your first layer. Use Cloudflare, AWS API Gateway, or NGINX to handle the broad strokes. Keep Redis-based limiting for fine-grained, business-logic-specific rules.

If you're building a public API: Implement token bucket for tiered rate limits. Always include rate limit headers. Document your limits clearly. Provide a way for users to check their current usage.

Regardless of your scale:

  • Always fail open when your rate limiting infrastructure is unavailable
  • Always monitor rate limit hits, Redis latency, and fallback activations
  • Always plan for Redis failure - it will happen
  • Layer your defenses - no single rate limiting strategy handles everything

Rate limiting isn't a solved problem you implement once and forget. It's an evolving system that needs monitoring, tuning, and adaptation as your traffic patterns change. Build it with that mindset, and it'll actually work when you need it to.

Share:
R

Written by

Raunak Gupta

DevOps engineer and technical writer with experience in cloud infrastructure, CI/CD pipelines, and system design. Passionate about making complex engineering topics accessible through clear, practical writing backed by real production experience.

Next

Caching Strategies That Work (And When They Fail)

Related Articles