Should I rate limit at the API gateway or in my application code?

Both, in layers. Use gateway-level rate limiting (Cloudflare, AWS API Gateway, NGINX, Envoy) as your first line of defense for broad per-IP and per-route limits — it stops abuse before it ever hits your servers. Use application-level rate limiting in Redis for fine-grained per-user, per-tier, or per-organization rules that depend on business logic the gateway can't see.

How do I prevent race conditions in distributed rate limiting?

Never read-then-write. The naive GET → check → SET pattern lets two concurrent requests both pass the check before either increments the counter. Use atomic Redis operations like INCR (which increments and returns the new value in a single command) or wrap your logic in a Lua script — Redis executes Lua scripts atomically, so no other request can slip in mid-flow.

What should my rate limiter do when Redis goes down?

Fail open — let the request through, log the failure, and alert. Failing closed (rejecting all requests because you can't check the limit) turns a Redis hiccup into a full outage. Add a circuit breaker so repeated Redis failures pause check attempts entirely for a short window. Rate limiting is a protection mechanism, not a gatekeeper for your service availability.

What rate limit headers should my API return?

Return X-RateLimit-Limit (the max), X-RateLimit-Remaining (what's left in the current window), and X-RateLimit-Reset (when the window rolls over) on every response, plus Retry-After on 429 responses. These follow the IETF draft for rate limit headers and let well-behaved clients self-regulate instead of having every limit hit turn into a debugging session.

Why do users blow past their rate limit when I run multiple servers?

Each server keeps its own in-memory counter unless you share state. With N servers behind a load balancer, a 100 requests/minute limit effectively becomes 100×N. The fix is shared state — typically Redis with atomic INCR — or pushing rate limiting up to the API gateway so a single component does the counting for the whole fleet.

How to Design a Rate Limiter That Works at Scale

Two API servers, one user, one shared rate limit of 100 requests per minute. Each server lets through 100. The user just blew through 200, your dashboards show no error, and the abuse alarm never fires. This is what a "working" rate limiter looks like in production when nobody told it about its sibling.

This is not a tutorial on "what is rate limiting." If you need that, you're in the wrong place. This article covers what breaks in production, the real implementation choices you'll face, and how to build a rate limiter that doesn't fall apart when it actually matters.

TL;DR

Rate limiting is trivial on a single server and broken on every distributed system that adds a load balancer. The fixes that actually hold in production: pick the right algorithm (sliding window counter for general APIs, token bucket for tiered SaaS), use atomic Redis operations or a Lua script to kill race conditions, push broad limits to the gateway/CDN, always fail open when Redis is down, and return X-RateLimit-*headers so clients aren't guessing.

Why Single-Node Rate Limiting Is a Lie

On a single server, rate limiting is trivial. You keep a counter in memory, increment it per request, and reject anything above the threshold. It works perfectly - in development.

Here's where it falls apart:

Multiple application instances. The moment you have two or more servers behind a load balancer, each instance maintains its own counter. A user with a 100 requests/minute limit effectively gets 100 × N, where N is the number of instances. Your limit is meaningless.

Distributed traffic patterns.Requests from the same user don't always hit the same server. Sticky sessions help, but they introduce their own scaling problems and aren't reliable under high load.

Shared limits per user or API key. If user abc123 has a global limit of 1000 requests/hour, every instance needs to agree on the current count. Without a shared state, this is impossible.

The bottom line:If your rate limiter doesn't share state across instances, it's not a rate limiter. It's a suggestion.

The Real Problems at Scale

Before jumping into solutions, you need to understand what actually goes wrong. These are the issues that show up in production, not in whiteboard interviews.

1. Inconsistent Counters

Two instances read the counter at the same time, both see 99 out of a 100limit, and both allow the request. Now you've served 101. This is a classic race condition, and it gets worse under load.

2. Race Conditions on Increment

Even with a shared store like Redis, a naive read-then-write pattern creates a window where concurrent requests can slip through. If you're doing GET → check → SET, you're vulnerable.

3. Burst Traffic Bypassing Limits

Fixed window algorithms are especially bad here. If a user sends 100 requests at 11:59:59 and another 100 at 12:00:01, they've sent 200 requests in 2 seconds - but both windows see only 100. Perfectly "within limits."

4. Redis Becoming a Bottleneck

Everyone reaches for Redis as the shared counter store. It works great - until it doesn't. At high request volumes, Redis can become the single point of failure for your entire rate limiting layer. CPU spikes, connection pool exhaustion, and increased latency under load are all real issues. The fix is usually a local-then-shared cache tier in front of Redis — see caching strategies that work (and when they fail) for the layered caching pattern and hot-key handling that keeps Redis from melting.

5. Clock Drift Across Systems

If your servers disagree on what time it is - even by a few hundred milliseconds - your time-window-based counters become unreliable. In distributed systems, clocks are never perfectly synchronized.

Rate Limiting Algorithms: What You Actually Need to Know

You don't need a deep dive into every algorithm. You need to know when each one breaks so you can pick the right one for your use case.

Fixed Window Counter

How it works: Divide time into fixed windows (e.g., 1-minute intervals). Count requests in the current window. Reject if the count exceeds the limit.

When it breaks: Boundary burst problem. A user can send double the allowed requests by timing them around the window boundary. If your limit is 100/min and a user sends 100 requests at XX:59 and 100 at XX+1:00, they've sent 200 in ~1 second.

plaintext

Window 1: [XX:00 ---- XX:59]  →  100 requests (OK)
Window 2: [XX+1:00 ---- XX+1:59]  →  100 requests (OK)

Reality: 200 requests in 2 seconds. Your "100/min" limit is useless.

Use it when:You need something dead simple, the burst issue is acceptable, and precision isn't critical.

Sliding Window Log

How it works: Store the timestamp of every request. When a new request comes in, count all timestamps within the last N seconds. Reject if the count exceeds the limit.

When it breaks:Memory. Storing individual timestamps for every request, for every user, gets expensive fast. If you have 1 million users each making 100 requests/minute, you're storing 100 million timestamps. Your Redis memory usage will spike hard.

Use it when: You need high accuracy and have a relatively small number of rate-limited entities.

Sliding Window Counter

How it works: A hybrid approach. Uses the current window count plus a weighted portion of the previous window count to approximate a sliding window.

plaintext

Effective count = (previous window count × overlap percentage) + current window count

Example:
- Previous window: 80 requests
- Current window: 30 requests
- We're 25% into the current window (75% overlap with previous)
- Effective count = (80 × 0.75) + 30 = 90

When it breaks:It's an approximation. Under very specific burst patterns, it can be slightly too lenient or too strict. For most use cases, this trade-off is worth it.

Use it when: You want better accuracy than fixed window without the memory cost of sliding log. This is the sweet spot for most applications.

Token Bucket

How it works:Each user has a "bucket" with a maximum number of tokens. Tokens are added at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected.

When it breaks: Implementation complexity in distributed systems. Maintaining accurate token counts across multiple instances with proper refill timing requires careful synchronization.

Use it when:You need to allow controlled bursts. A user who hasn't made requests in a while should be able to make several quick requests. This maps well to API rate limits for paid tiers.

Leaky Bucket

How it works: Requests enter a queue (the bucket) and are processed at a fixed rate. If the queue is full, new requests are dropped.

When it breaks: It enforces a strict output rate, which means even legitimate burst traffic gets queued or dropped. For APIs where response time matters, the queueing introduces latency.

Use it when: You need a constant, predictable processing rate - like sending webhook deliveries or processing background jobs.

Algorithm Comparison at a Glance

Algorithm	Burst Handling	Memory Usage	Accuracy	Complexity
Fixed Window	Poor - boundary exploit	Low	Low	Very simple
Sliding Window Log	Excellent	High - stores every timestamp	Very high	Moderate
Sliding Window Counter	Good - approximate	Low	Good enough	Moderate
Token Bucket	Excellent - controlled bursts	Low	High	Moderate-High
Leaky Bucket	Strict - no bursts allowed	Low	High	Moderate

My recommendation: Start with sliding window counter for most use cases. Move to token bucket if you need burst control. The other algorithms are either too simple (fixed window) or too expensive (sliding log) for production at scale.

What Actually Works in Production

Theory is one thing. Here's what you'll actually deploy.

Option A: Redis + Atomic Operations

This is the most common approach, and for good reason. Redis is fast, widely supported, and has built-in atomic operations that solve the race condition problem.

The naive approach (don't do this):

Naive approach - race condition

# ❌ WRONG - race condition between GET and SET
count = redis.get(f"rate_limit:{user_id}")
if count and int(count) >= LIMIT:
    return 429  # Too Many Requests
redis.incr(f"rate_limit:{user_id}")
redis.expire(f"rate_limit:{user_id}", WINDOW_SIZE)

The gap between get and incr is a race condition. Two concurrent requests can both read the same count and both pass the check.

The correct approach - atomic INCR + EXPIRE:

Atomic increment - correct

# ✅ CORRECT - atomic increment, no race condition
key = f"rate_limit:{user_id}:{current_window}"
count = redis.incr(key)
if count == 1:
    redis.expire(key, WINDOW_SIZE)
if count > LIMIT:
    return 429

This works because INCR is atomic in Redis - it increments and returns the new value in a single operation. No read-then-write race.

But there's still a problem. The INCR and EXPIRE are two separate commands. If your service crashes between them, the key lives forever. Fix this with a Lua script:

Atomic rate limiting - Lua script

-- Lua script for atomic rate limiting in Redis
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local count = redis.call('INCR', key)
if count == 1 then
    redis.call('EXPIRE', key, window)
end

if count > limit then
    return 0  -- rejected
end
return 1  -- allowed

Lua scripts in Redis are executed atomically. No partial execution, no race conditions. This is the baseline implementation you should use.

Key design:

plaintext

rate_limit:{user_id}:{window_timestamp}

Example: rate_limit:user_123:1700000000 where the timestamp represents the start of the current window.

Problems with this approach:

Hot keys. If one user generates massive traffic, all their requests hit the same Redis key. This can cause uneven load distribution across Redis cluster slots.
Redis latency under load. Every single request now requires a Redis round-trip. At 50,000+ requests/second, that latency adds up.
Single point of failure.If Redis goes down, your rate limiting goes with it. What's your fallback?

Option B: Distributed Rate Limiting at the Gateway

Here's an insight that saves a lot of complexity: offloading rate limiting to the edge or API gateway reduces application complexity dramatically.

Instead of implementing rate limiting in your application code, push it to:

AWS API Gateway - built-in throttling with per-key, per-method limits
Cloudflare Rate Limiting - edge-level, before traffic even hits your infrastructure
Kong / NGINX - configurable rate limiting plugins
Envoy Proxy - supports both local and global rate limiting

Why this works better than you think:

Rate limiting happens before your application processes the request. No wasted compute on requests that should be rejected.
Gateway-level solutions are purpose-built for this. They handle distributed counting, clock synchronization, and failover internally.
Your application code stays clean. No Redis connections for rate limiting, no Lua scripts, no counter management.

Example - NGINX rate limiting configuration:

nginx.conf

http {
    # Define a rate limiting zone
    # $binary_remote_addr = client IP (binary form, saves memory)
    # zone=api_limit:10m = 10MB shared memory zone (~160,000 IPs)
    # rate=10r/s = 10 requests per second per IP
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

    server {
        location /api/ {
            # burst=20 allows 20 requests to queue
            # nodelay processes burst requests immediately
            limit_req zone=api_limit burst=20 nodelay;
            limit_req_status 429;

            proxy_pass http://backend;
        }
    }
}

Example - AWS API Gateway throttling (via CloudFormation):

cloudformation.yaml

Resources:
  ApiUsagePlan:
    Type: AWS::ApiGateway::UsagePlan
    Properties:
      Throttle:
        RateLimit: 100      # requests per second
        BurstLimit: 200     # maximum concurrent requests
      Quota:
        Limit: 10000        # requests per day
        Period: DAY

My recommendation:Use gateway-level rate limiting as your first line of defense. Add application-level limiting only for business-logic-specific rules that the gateway can't handle.

Option C: Token Bucket - The Best Practical Model for APIs

If you're building a public API with tiered rate limits (free, pro, enterprise), the token bucket algorithm is the best fit.

It handles bursts gracefully.A user who hasn't made requests in 30 seconds has accumulated tokens. They can make a quick burst of requests without hitting limits.

Implementation with Redis:

token_bucket.py

import time
import redis

def check_rate_limit(redis_client, user_id, max_tokens, refill_rate):
    """
    Token bucket rate limiter using Redis.

    max_tokens: Maximum burst size (bucket capacity)
    refill_rate: Tokens added per second
    """
    key = f"token_bucket:{user_id}"
    now = time.time()

    # Lua script for atomic token bucket
    lua_script = """
    local key = KEYS[1]
    local max_tokens = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])

    local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
    local tokens = tonumber(bucket[1])
    local last_refill = tonumber(bucket[2])

    -- Initialize bucket if it doesn't exist
    if tokens == nil then
        tokens = max_tokens
        last_refill = now
    end

    -- Calculate tokens to add based on elapsed time
    local elapsed = now - last_refill
    local new_tokens = elapsed * refill_rate
    tokens = math.min(max_tokens, tokens + new_tokens)

    -- Try to consume a token
    local allowed = 0
    if tokens >= 1 then
        tokens = tokens - 1
        allowed = 1
    end

    -- Update bucket state
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)

    return {allowed, math.floor(tokens)}
    """

    result = redis_client.eval(lua_script, 1, key, max_tokens, refill_rate, now)
    allowed, remaining = result

    return {
        "allowed": bool(allowed),
        "remaining": remaining,
        "limit": max_tokens,
        "retry_after": None if allowed else 1.0 / refill_rate
    }


# Usage per tier
RATE_LIMITS = {
    "free":       {"max_tokens": 10,  "refill_rate": 1},    # 1 req/sec, burst of 10
    "pro":        {"max_tokens": 50,  "refill_rate": 10},   # 10 req/sec, burst of 50
    "enterprise": {"max_tokens": 200, "refill_rate": 100},  # 100 req/sec, burst of 200
}

Why token bucket wins for tiered APIs:

The max_tokens parameter directly maps to burst allowance per tier
The refill_rate maps to sustained request rate
Both are independently tunable per plan
Users get clear, predictable behavior

The Trade-Offs - Choosing Your Approach

Every solution has costs. Here's the honest breakdown:

Approach	Pros	Cons	Best For
Fixed Window + Redis	Dead simple, low memory	Boundary burst exploit	Internal services
Sliding Window Counter	Good accuracy, low memory	Slightly complex, approximate	General-purpose API limits
Token Bucket + Redis	Burst-friendly, intuitive	More complex implementation	Public APIs with tiers
Gateway-Level	No app code, handles scale	Less control over complex rules	First line of defense
Sliding Window Log	Extremely accurate	High memory, expensive	Precision-critical limits

There is no universally correct choice.The right approach depends on your traffic volume, accuracy requirements, and how much operational complexity you're willing to take on.

What Breaks at Scale - The Section Nobody Talks About

Redis CPU Spikes

Rate limiting means hitting Redis on every single request. At 50,000 requests/second, that's 50,000 Redis operations per second just for rate limiting.

Mitigation strategies:

Dedicated Redis instancefor rate limiting. Don't share with your cache or session store.
Redis Cluster for horizontal sharding. Distribute keys across multiple nodes.
Local caching with periodic sync. Keep a local counter, sync to Redis every N requests. Trade accuracy for performance.

Hybrid rate limiter - local + Redis

# Local counter with periodic Redis sync
# Trades strict accuracy for massive performance gain
import threading
import time

class HybridRateLimiter:
    def __init__(self, redis_client, sync_interval=1.0):
        self.redis = redis_client
        self.local_counts = {}  # {key: count}
        self.sync_interval = sync_interval
        self._start_sync_thread()

    def check(self, key, limit):
        # Check local count first - no Redis call
        local = self.local_counts.get(key, 0)
        if local >= limit:
            return False
        self.local_counts[key] = local + 1
        return True

    def _sync_to_redis(self):
        while True:
            time.sleep(self.sync_interval)
            counts = self.local_counts.copy()
            self.local_counts.clear()
            for key, count in counts.items():
                self.redis.incrby(f"rate:{key}", count)

Key Explosion

If you're creating a Redis key per user per time window, the math gets ugly fast.

plaintext

1 million users × 60 windows/hour = 60 million keys per hour

Mitigation:

Use short, binary-efficient key names. rl:u123:1700000 instead of rate_limit:user_id_123:window_1700000000.
Set aggressive TTLs.Don't keep keys longer than 2× your window size.
Monitor INFO memory and dbsize in Redis.
Consider hash-based grouping.Store multiple users' counts in a single Redis hash to reduce per-key overhead.

Clock Drift Across Systems

If server A thinks it's 12:00:00.000and server B thinks it's 11:59:59.700, their window calculations will disagree.

Mitigation:

Use the Redis server's clock, not the application server's. Let Redis compute timestamps via Lua scripts with redis.call('TIME').
Use larger windows.A 300ms drift matters for a 1-second window. It's irrelevant for a 60-second window.
Run NTP with tight sync intervals across all servers.

Network Latency Affecting Limits

Every rate limit check requires a network round-trip to Redis. Under normal conditions, this is 0.5-2ms. During network congestion or Redis failover, it can spike to 50-200ms.

The cascading failure scenario:

Redis latency increases to 100ms
Request threads block waiting for rate limit checks
Thread pool exhausts
Application starts rejecting all requests - not because of rate limits, but because it can't check them

Mitigation:

Set aggressive timeouts on Redis connections (5-10ms).
Fail open, not closed. If Redis is unreachable, allow the request. Rate limiting is a protection mechanism, not a gatekeeper.
Circuit breaker pattern. After N consecutive Redis failures, stop trying for M seconds.

Fail-open rate limiter with circuit breaker

# Fail-open rate limiter with circuit breaker
class ResilientRateLimiter:
    def __init__(self, redis_client, failure_threshold=5, recovery_time=30):
        self.redis = redis_client
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_time = recovery_time
        self.circuit_open_until = 0

    def check(self, key, limit):
        # Circuit is open - skip Redis, allow request
        if time.time() < self.circuit_open_until:
            return True  # fail open

        try:
            result = self._check_redis(key, limit)
            self.failure_count = 0  # reset on success
            return result
        except (redis.ConnectionError, redis.TimeoutError):
            self.failure_count += 1
            if self.failure_count >= self.failure_threshold:
                self.circuit_open_until = time.time() + self.recovery_time
            return True  # fail open

This is a critical design decision.Failing open feels wrong - you're letting unlimited traffic through. But the alternative is failing closed, which means a Redis hiccup takes down your entire service. In almost every case, failing open is the right call.

The Recommended Production Architecture

Layer 1: Edge / Gateway - Broad Protection

Use your CDN or API gateway for the first line of defense.

What it handles:

Per-IP rate limiting (e.g., 1000 requests/minute per IP)
Geographic blocking
Basic bot detection
DDoS mitigation

Tools:Cloudflare, AWS WAF, NGINX, or your load balancer's built-in throttling.

Layer 2: Application - Business Logic Limits

Use Redis-backed rate limiting in your application for fine-grained control.

What it handles:

Per-user limits based on subscription tier
Per-endpoint limits (e.g., stricter limits on write operations)
Per-organization limits for B2B APIs

Layer 3: Fallback - Graceful Degradation

python

if redis_available:
    enforce_rate_limit()
elif local_cache_available:
    enforce_approximate_limit()  # local counters
else:
    allow_request()  # fail open
    log_warning("Rate limiting disabled - Redis unreachable")
    increment_metric("rate_limit.fallback")

Critical insight:Always have a fallback strategy. "Redis never goes down" is not a strategy.

Architecture Diagram

Three-layer rate-limiter architecture: CDN edge for IP limits and DDoS, application server with Redis for per-user and per-tier limits, and a fail-open fallback path when Redis is unreachable. — Three layers — broad at the edge, fine in the app, fail-open the moment Redis disappears.

Advanced Considerations

Per-User vs. Per-IP Limiting

Per-IP is simple but breaks behind NATs, corporate proxies, and VPNs. Thousands of legitimate users can share a single IP.

Per-user(authenticated) is more accurate but doesn't protect against unauthenticated abuse.

The right answer is both. Per-IP at the gateway for unauthenticated traffic. Per-user in the application for authenticated requests.

Tier-Based Rate Limits

If you're running a SaaS with free and paid tiers, your rate limits are a product feature, not just an infrastructure concern.

Rate limits as product config

# Rate limit configuration as product feature
TIER_LIMITS = {
    "free": {
        "requests_per_minute": 60,
        "burst": 10,
        "daily_quota": 1000,
    },
    "pro": {
        "requests_per_minute": 600,
        "burst": 100,
        "daily_quota": 50000,
    },
    "enterprise": {
        "requests_per_minute": 6000,
        "burst": 1000,
        "daily_quota": None,  # unlimited
    },
}

Important: Make rate limits visible to users. Include headers in every response:

http

HTTP/1.1 200 OK
X-RateLimit-Limit: 600
X-RateLimit-Remaining: 542
X-RateLimit-Reset: 1700000060
Retry-After: 30           # Only on 429 responses

These headers are defined in the IETF RFC 6585 and draft-ietf-httpapi-ratelimit-headers specifications. Developers expect them. Not including them turns every rate limit hit into a debugging session for your users.

Common Mistakes to Avoid

1. Not setting rate limit headers on responses.Your users shouldn't have to guess their limits. Always return X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset.

2. Failing closed when Redis is down. A Redis outage should not take down your service. Fail open, log it, alert on it, and fix Redis.

3. Using the same Redis instance for rate limiting and caching. A cache stampede can starve your rate limiter. Dedicated instances for dedicated purposes.

4. Not rate limiting internal services. "It's internal" doesn't mean it can't overwhelm downstream dependencies. A buggy service can DDoS your database through an unprotected internal API.

5. Ignoring the cost of rate limiting itself. Every rate limit check is a Redis call. At scale, this adds latency to every request. Benchmark the overhead and optimize accordingly.

6. Hardcoding rate limits. Limits change. Tiers change. Make limits configurable - ideally through a config service or feature flags, not deployment.

Summary: What You Should Actually Do

If you're just starting out: Use Redis with the atomic INCR+ Lua script approach. Sliding window counter algorithm. It's simple, well-understood, and handles most traffic patterns well.

If you're handling significant traffic: Add gateway-level rate limiting as your first layer. Use Cloudflare, AWS API Gateway, or NGINX to handle the broad strokes. Keep Redis-based limiting for fine-grained, business-logic-specific rules.

If you're building a public API: Implement token bucket for tiered rate limits. Always include rate limit headers. Document your limits clearly. Provide a way for users to check their current usage.

Regardless of your scale:

Always fail open when your rate limiting infrastructure is unavailable
Always monitor rate limit hits, Redis latency, and fallback activations
Always plan for Redis failure - it will happen
Layer your defenses - no single rate limiting strategy handles everything

Rate limiting isn't a solved problem you implement once and forget. It's an evolving system that needs monitoring, tuning, and adaptation as your traffic patterns change. Build it with that mindset, and it'll actually work when you need it to.

Frequently Asked Questions

Sliding window counter is the sweet spot for most APIs — it gives good accuracy without storing every request timestamp. Use token bucket if you need to allow controlled bursts (typical for tiered SaaS APIs). Avoid fixed window counters in production because of the boundary burst exploit, and avoid sliding window log unless you have a small number of users — the memory cost grows quickly.

How to Design a Rate Limiter That Actually Works at Scale

Why Single-Node Rate Limiting Is a Lie

The Real Problems at Scale

1. Inconsistent Counters

2. Race Conditions on Increment

3. Burst Traffic Bypassing Limits

4. Redis Becoming a Bottleneck

5. Clock Drift Across Systems

Rate Limiting Algorithms: What You Actually Need to Know

Fixed Window Counter

Sliding Window Log

Sliding Window Counter

Token Bucket

Leaky Bucket

Algorithm Comparison at a Glance

What Actually Works in Production

Option A: Redis + Atomic Operations

Option B: Distributed Rate Limiting at the Gateway

Option C: Token Bucket - The Best Practical Model for APIs

The Trade-Offs - Choosing Your Approach

What Breaks at Scale - The Section Nobody Talks About

Redis CPU Spikes

Key Explosion

Clock Drift Across Systems

Network Latency Affecting Limits

The Recommended Production Architecture

Layer 1: Edge / Gateway - Broad Protection

Layer 2: Application - Business Logic Limits

Layer 3: Fallback - Graceful Degradation

Architecture Diagram

Advanced Considerations

Per-User vs. Per-IP Limiting

Tier-Based Rate Limits

Common Mistakes to Avoid

Summary: What You Should Actually Do

Frequently Asked Questions

Caching Strategies That Work (And When They Fail)

SQS vs Kafka: When to Use What in Real Systems

SNS vs Kafka: When You Actually Need Pub/Sub vs an Event Log

How to Design a Rate Limiter That Actually Works at Scale

Why Single-Node Rate Limiting Is a Lie

The Real Problems at Scale

1. Inconsistent Counters

2. Race Conditions on Increment

3. Burst Traffic Bypassing Limits

4. Redis Becoming a Bottleneck

5. Clock Drift Across Systems

Rate Limiting Algorithms: What You Actually Need to Know

Fixed Window Counter

Sliding Window Log

Sliding Window Counter

Token Bucket

Leaky Bucket

Algorithm Comparison at a Glance

What Actually Works in Production

Option A: Redis + Atomic Operations

Option B: Distributed Rate Limiting at the Gateway

Option C: Token Bucket - The Best Practical Model for APIs

The Trade-Offs - Choosing Your Approach

What Breaks at Scale - The Section Nobody Talks About

Redis CPU Spikes

Key Explosion

Clock Drift Across Systems

Network Latency Affecting Limits

The Recommended Production Architecture

Layer 1: Edge / Gateway - Broad Protection

Layer 2: Application - Business Logic Limits

Layer 3: Fallback - Graceful Degradation

Architecture Diagram

Advanced Considerations

Per-User vs. Per-IP Limiting

Tier-Based Rate Limits

Common Mistakes to Avoid

Summary: What You Should Actually Do

Frequently Asked Questions

What is the best rate limiting algorithm to use?

Should I rate limit at the API gateway or in my application code?

How do I prevent race conditions in distributed rate limiting?

What should my rate limiter do when Redis goes down?

What rate limit headers should my API return?

Why do users blow past their rate limit when I run multiple servers?

Caching Strategies That Work (And When They Fail)

SQS vs Kafka: When to Use What in Real Systems

SNS vs Kafka: When You Actually Need Pub/Sub vs an Event Log