Case Study: Scaling Redis with Twemproxy and Optimizing Connection Pools
The team rescued a crashing Redis cache handling 100K‑120K QPS by shrinking Nginx Lua timeouts, trimming connection pools, adding four Redis nodes behind Twemproxy, and splitting keys to raise cardinality, which eliminated connection spikes, balanced shard load, and restored stable performance.
Background
Project A collects data generated by other projects, stores it for a limited time without persistence, using a single Redis instance as cache. During peak periods the Redis QPS reached 100K, even 120K, and eventually the instance crashed, illustrating Murphy's Law. The incident highlights the need for proactive operations monitoring.
Analysis and Solution
2.1 Preliminary Analysis
The crash manifested as inability to establish new connections, severe timeouts, and data read failures. System logs showed many "kernel: Possible SYN flooding on port xxxx. Sending cookies" messages, indicating the Redis instance was consuming connections. At the time, Redis had about 7K connections and QPS of 100K. The team investigated whether Redis pipelining could help.
The application uses Nginx Lua with the lua-resty-redis client. The relevant configuration is:
lua set_keepalive(5000, 20)The first argument is the max idle timeout, the second is the pool size. The total connection count can be estimated as:
connectionNum = machineNum × nginxWorkerProcess × pool_size
With four web servers, each running 18 Nginx workers, and pool_size 20, the theoretical connections are 4×18×20 = 1440, far less than the observed 7K, indicating other factors.
The team considered using Redis Pipeline, but most commands are INCR (+1), so pipeline gains were limited.
2.2 Horizontal Scaling with Twemproxy
To scale out, four additional Redis instances were added and a Twemproxy (nutcracker) proxy was deployed on each web server. Twemproxy provides sharding at the proxy layer, simplifying client logic.
However, QPS remained high and the local Twemproxy degraded web server performance. Data skew was observed: some Redis nodes handled up to 80K QPS while others only ~5K.
The root cause was identified as a 60‑second timeout in Nginx Lua, leading to many lingering connections.
2.3 Problem Resolution
The team reduced the timeout to 2 seconds and adjusted the connection pool. Connections dropped to around 1K and socket usage decreased markedly.
To address data skew, they increased key cardinality. Instead of a single key per minute, they generated 10+ keys per minute, distributing load across shards. After key splitting, QPS was balanced and web performance improved.
Principle Discussion
The author briefly analyzes Twemproxy’s consistent‑hash function. Twemproxy supports several hash algorithms (fnv1a_64, etc.) and distribution modes (ketama, modula, random). In the production setup, fnv1a_64 with ketama is used.
A Python implementation of the fnv1a_64 hash is shown:
hval = FNV_64_INIT
for c in s:
hval = hval ^ ord(c)
hval = (hval * FNV_64_PRIME) % UINT32_MAX
return hvalHigher key repetition leads to higher probability of being routed to the same shard, causing skew. Ketama has O(log N) complexity, while modula is O(1).
Case Summary
This case illustrates three key take‑aways: systematic problem‑identification, Redis bottleneck mitigation, and scaling‑out analysis. When encountering performance limits, consider code and server optimizations, product selection, data flow redesign, and appropriate scaling strategy (scale‑out vs. scale‑up).
37 Interactive Technology Team
37 Interactive Technology Center
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.