Rate Limiting Strategies for API Services: Design, Implementation, and Load Shedding
This article explains why availability and reliability are critical for web APIs, outlines four common rate‑limiting techniques used at Stripe, describes how to choose and implement request, concurrent, usage‑based, and worker‑utilization limiters, and provides practical guidance for safely deploying them in production.
Rate Limiters and Load Shedding
Rate limiters control the rate of traffic sent or received over a network. They are appropriate when users can tolerate a slower request pace without affecting the outcome of their calls. In real‑time communication scenarios where inserting delays is impossible, other strategies are required.
Load shedding is the practice of deliberately dropping low‑priority requests to ensure critical ones are processed, based on overall system state rather than per‑user behavior.
Using Different Types of Rate Limiters
Stripe employs four distinct rate‑limiting mechanisms in production.
Request Rate Limiter
This limiter caps the number of requests each user may send per second (N requests/second). It has rejected millions of requests, especially accidental test scripts, and behaves identically in test and production modes to preserve developer experience. The limiter also adapts to traffic spikes such as flash sales.
Concurrent Request Rate Limiter
Unlike the request limiter, this limiter caps the maximum number of concurrent requests. It helps prevent resource contention when API endpoints depend on external services and users retry frequently. Although triggered rarely (about 12,000 times this month), it effectively controls CPU‑intensive endpoints.
Usage‑Based Load Shedding
Traffic is divided into critical API requests (e.g., order creation) and non‑critical requests (e.g., listing historical orders). A Redis cluster tracks the current count of each type. A reserve of 20 % is kept for critical traffic; when non‑critical usage exceeds 80 % of capacity, those requests are rejected with HTTP 503.
Worker‑Utilization Based Load Shedding
Most API services run a pool of worker threads or coroutines. Traffic is classified into four categories: critical API requests, HTTP POST, HTTP GET, and test requests. The system monitors available workers; if a worker is overloaded, non‑critical traffic (starting with test requests) is gradually shed. When capacity recovers, traffic is slowly restored to avoid avalanche effects.
Building Rate Limiters – Practical Guidance
Stripe implements its limiters using the token‑bucket algorithm: a central bucket holds tokens, each request consumes a token, and tokens are replenished at a steady rate. Each user has an independent bucket.
Redis is used to store token counts, either self‑hosted or via managed services such as AWS ElastiCache.
Insert the limiter safely into your middleware chain . Ensure that any internal errors (e.g., Redis failure) do not affect request processing; catch all exceptions.
Return clear error responses . Decide whether to expose HTTP 429 or HTTP 503 and provide actionable messages.
Provide an emergency kill‑switch . Implement monitoring and alerts to disable the limiter if needed.
Deploy gradually and monitor traffic . Tune thresholds to avoid impacting existing user patterns, possibly collaborating with client developers.
Conclusion
Rate limiting is one of the most effective ways to achieve horizontal scalability for APIs. The strategies described are not required from day one; they can be introduced incrementally as the need arises.
Start with a request rate limiter, the most essential and commonly used.
Gradually add the other three limiters to address different problem classes.
When adding new limiters, follow safe rollout practices, handle errors gracefully, keep a kill‑switch, and rely on metrics to monitor trigger frequency.
For implementation details, see the public gist linked below.
Related links:
English version of the article: https://stripe.com/blog/rate-limiters
Token bucket algorithm Wikipedia page: https://en.wikipedia.org/wiki/Token_bucket
Implementation gist: https://gist.github.com/ptarjan/e38f45f2dfe601419ca3af937fff574d
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.