Root Causes and Troubleshooting of Redis Timeout Exceptions
This article analyzes why Redis service nodes may experience massive TimeoutException errors, covering external influences such as CPU and memory contention, network resource exhaustion, and internal Redis usage issues like slow queries, persistence overhead, and configuration pitfalls, and provides concrete diagnostic commands and mitigation steps.
An alert email reported a large number of Redis service nodes timing out, prompting an investigation that revealed extensive TimeoutException errors caused by overloaded connections exceeding Redis capacity.
1. External factors affecting Redis service nodes
Redis runs on physical servers that share CPU, memory, and network resources with other applications, leading to resource competition.
1.1 CPU resource competition
Redis is CPU‑intensive; co‑located CPU‑heavy workloads can degrade its performance, especially when those workloads have unstable CPU usage.
Generally avoid mixing Redis with other service types on the same host.
Even same‑type Redis instances should be isolated per upstream application.
Binding Redis to specific CPUs can reduce context‑switch overhead, but when persistence (AOF/RDB) forks a child process, the child shares the same CPU, potentially causing severe instability.
1.2 Memory pressure and swapping
When Redis memory is swapped to disk, latency spikes dramatically. Monitoring info memory for low fragmentation (<1) can indicate swap usage.
To inspect swap usage for a Redis process:
cat /proc/1686/smapsEnsure swap values are 0 KB or 4 KB.
Configure maxmemory so the total allocated memory for all Redis instances stays below physical RAM, and disable swap at the OS level when possible.
1.3 Network problems
Network bandwidth exhaustion, exhausted file‑descriptor limits, or a full TCP backlog can all cause connection failures.
Check the current file‑descriptor limit:
ulimit -nIncrease it if necessary:
ulimit -n {num}Adjust the TCP backlog (default 511) and the kernel parameter net.core.somaxconn when under high concurrency:
echo {num} > /proc/sys/net/core/somaxconnDetect backlog overflow with:
netstat -s | grep overflowedTest network latency with Redis CLI:
redis-cli -h {host} -p {port} --latencyCollect historical latency data:
redis-cli -h {host} -p {port} --latency-historyVisualize latency distribution:
redis-cli -h {host} -p {port} --latency-dist2. Redis usage issues
2.1 Slow queries
Slow queries often stem from poor key design, inappropriate data types, lack of batch operations, or large‑scale data manipulations in production.
Keep keys short yet meaningful.
Choose the right data structure (hash vs. string, set vs. zset) to avoid storing huge objects.
Use MGET or pipelines instead of many individual GET calls.
Avoid massive data operations on live systems.
2.2 Monitoring Redis health
Run:
redis-cli -h {host} -p {port} --statto view key count, memory usage, client connections, blocked clients, total requests, and connections.
2.3 Persistence impact
Forking for AOF/RDB persistence consumes CPU and memory; long‑running forks should stay under 1 second (check with info stats ).
AOF fsync every second can block the main thread if the previous fsync took more than 2 seconds.
Transparent Huge Pages (THP) can increase write latency from 4 KB to 2 MB pages, leading to slow queries and connection issues.
Source: cnblogs.com/niejunlei/p/12900578.html
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.