Root Cause Analysis of Redis Timeout in a Spring Cloud Service Using Lettuce and Netty
A Docker image upgrade reduced Netty EventLoop threads, causing a Pub/Sub listener’s blocking Future.get() to stall one thread, fill a Redis cluster connection’s receive buffer and trigger widespread Redis timeouts in the custom Lettuce cache framework, which were eliminated by increasing I/O threads or making the callback asynchronous.
In a Spring Cloud micro‑service used by iQIYI overseas, Redis is heavily used for caching, message queues and distributed locks. After upgrading the Docker image of the service, a large number of Redis timeout errors appeared, while rolling back to the previous image eliminated the problem.
The service accesses Redis in two ways: directly via Spring's RedisTemplate and indirectly through a custom cache framework that relies on Lettuce. The timeout only occurs when the custom framework accesses a Redis cluster; the same framework works fine with a single‑node Redis, and RedisTemplate never times out.
Investigation revealed that one of the six TCP connections to the Redis cluster had a constantly non‑empty receive buffer, indicating that data arrived at the socket but was not consumed. Thread‑dump analysis with Alibaba Arthas showed that a Netty EventLoop thread (named epollEventLoop‑9‑3 ) was in TIMED_WAITING because a Pub/Sub listener executed a blocking Future.get() call.
Lettuce uses Netty’s NIO model: a limited number of EventLoop threads manage many connections. If an EventLoop thread is blocked, all connections registered to it cannot process I/O events, leading to the observed buffer buildup and request timeouts.
The high‑version image created only three EventLoop threads, causing the Pub/Sub connection and a data‑routing connection to share the same thread. In the low‑version image, 13 EventLoop threads were created (due to an older JDK incorrectly detecting the host’s CPU count), so the connections were spread across different threads and the problem did not manifest.
Two remediation approaches were proposed:
Increase the number of Netty I/O threads (e.g., via ClientResources or the io.netty.eventLoopThreads property). After adjusting this setting, the timeout disappeared.
Make the Pub/Sub callback asynchronous, moving the blocking logic out of the Netty EventLoop thread. This is the approach taken by Spring‑Data‑Redis.
Additional insights include:
Spring‑Data‑Redis creates its own RedisClusterClient with a separate EventLoop group, so its connections are unaffected by the custom framework’s blocked thread.
When accessing a single‑node Redis, Lettuce creates only one connection plus a Pub/Sub connection, avoiding the contention seen with cluster mode.
The analysis highlights the importance of not performing blocking operations on Netty EventLoop threads, understanding EventLoop allocation, and being aware of JDK version differences that affect CPU core detection in container environments.
References: Lettuce documentation, Spring Data Redis Pub/Sub guide, Netty learning resources, JetCache Redis‑Lettuce guide, Arthas thread diagnostics, and Oracle’s blog on Docker‑aware JDK behavior.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.