Detecting and Solving Redis Hot Keys in Ten‑Million‑QPS Systems – A Complete Guide
The article analyzes how hot keys in Redis cause node overload, cache breakdown, and system avalanches in ten‑million‑QPS distributed systems, outlines four detection techniques, compares their trade‑offs, and presents five practical mitigation strategies with Go code examples and real‑world performance results.
Hot Key Impact and Causes
A hot key is a cache key whose request frequency far exceeds the average – often more than 50 × and lasting at least 10 seconds. In a distributed system handling tens of millions of requests per second, hot keys can overload a single Redis node, trigger cache breakdown, and cascade into a system‑wide avalanche.
Node overload : CPU spikes above 95 % and a 10 Gbps network link saturates; latency jumps from ~10 ms to >500 ms; the node may crash.
Cache breakdown : When the hot key expires, millions of requests flood the database, exhausting connection pools (e.g., MySQL connections rise from 100 to >1 000) and causing DB crashes.
System avalanche : Failure of one cache node forces traffic onto the remaining nodes, leading to chain overload and service outage lasting dozens of minutes.
Root causes identified in production include traffic concentration (e.g., flash‑sale product ID), data skew (celebrity user profiles), sudden business spikes (breaking news), and overly coarse cache‑key design.
Hot Key Detection Methods
Client‑side instrumentation (low‑cost, fast rollout)
Each application instance counts accesses to keys locally and reports the counts to a monitoring platform every 10 seconds. When a configurable threshold (e.g., ≥100 000 accesses per 10 s) is exceeded, the key is flagged as hot. This method requires no changes to the Redis cluster.
package main
import (
"sync"
"time"
)
type HotKeyDetector struct {
keyCount map[string]int
mu sync.RWMutex
threshold int
reportChan chan map[string]int
}
func NewHotKeyDetector(threshold int) *HotKeyDetector {
d := &HotKeyDetector{
keyCount: make(map[string]int),
threshold: threshold,
reportChan: make(chan map[string]int, 10),
}
go d.startReport()
return d
}
func (h *HotKeyDetector) IncrKey(key string) {
h.mu.Lock()
h.keyCount[key]++
h.mu.Unlock()
}
func (h *HotKeyDetector) startReport() {
ticker := time.NewTicker(10 * time.Second)
defer ticker.Stop()
for range ticker.C {
h.mu.RLock()
snapshot := make(map[string]int, len(h.keyCount))
for k, v := range h.keyCount {
snapshot[k] = v
}
h.mu.RUnlock()
hot := make(map[string]int)
for k, v := range snapshot {
if v >= h.threshold {
hot[k] = v
}
}
if len(hot) > 0 {
h.reportChan <- hot
}
h.mu.Lock()
h.keyCount = make(map[string]int)
h.mu.Unlock()
}
}Pros : low development cost, no impact on Redis cluster, detection latency in seconds.
Cons : each client maintains its own counters, possible missed hot keys, memory and lock overhead on the client.
Redis native HOTKEYS command (Redis 8.6+)
Redis 8.6 introduced the HOTKEYS command, which samples command frequencies server‑side and returns the top hot keys. It provides high‑precision detection without client changes.
package main
import (
"context"
"log"
"github.com/redis/go-redis/v9"
)
func main() {
rdb := redis.NewClusterClient(&redis.ClusterOptions{Addrs: []string{"redis-node1:6379", "redis-node2:6379", "redis-node3:6379"}})
ctx := context.Background()
// Start hot‑key tracking: top 10 keys, full sampling (sample=1)
startArgs := &redis.HotkeysStartArgs{Metrics: []redis.HotkeysMetric{redis.HotkeysMetricCPU, redis.HotkeysMetricNet}, Count: 10, Sample: 1}
if err := rdb.HotkeysStart(ctx, startArgs).Err(); err != nil {
log.Fatalf("start hotkey tracking failed: %v", err)
}
ticker := time.NewTicker(5 * time.Second)
defer ticker.Stop()
for range ticker.C {
result, err := rdb.HotkeysGet(ctx).Result()
if err != nil {
log.Printf("get hotkey failed: %v", err)
continue
}
for _, key := range result.ByCPUTime {
log.Printf("Key: %s, CPU: %d, Count: %d", key.Key, key.Value, key.Count)
}
}
}Pros : high detection accuracy, zero client intrusion, works with Redis clusters, supports multiple metrics (CPU, network).
Cons : requires Redis 8.6 or newer; full sampling adds <5 % CPU overhead on the server.
Proxy‑layer interception (e.g., Twemproxy, Codis)
A cache proxy can intercept all Redis commands, count key accesses with a sliding‑window algorithm, and flag hot keys in real time. This approach avoids changes to both clients and Redis servers.
package proxy
import (
"sync"
"time"
)
type SlidingWindow struct {
windowSize time.Duration // e.g., 10s
interval time.Duration // e.g., 1s
counts map[int]int // sub‑window counters
mu sync.RWMutex
}
func NewSlidingWindow(windowSize, interval time.Duration) *SlidingWindow {
sw := &SlidingWindow{windowSize: windowSize, interval: interval, counts: make(map[int]int)}
go sw.cleanExpired()
return sw
}
func (sw *SlidingWindow) Incr() {
sw.mu.Lock()
defer sw.mu.Unlock()
id := int(time.Now().Unix() / int64(sw.interval.Seconds()))
sw.counts[id]++
}
func (sw *SlidingWindow) TotalCount() int {
sw.mu.RLock()
defer sw.mu.RUnlock()
cur := int(time.Now().Unix() / int64(sw.interval.Seconds()))
min := cur - int(sw.windowSize.Seconds()/sw.interval.Seconds())
total := 0
for id, cnt := range sw.counts {
if id >= min {
total += cnt
}
}
return total
}
func (sw *SlidingWindow) cleanExpired() {
ticker := time.NewTicker(sw.interval)
defer ticker.Stop()
for range ticker.C {
sw.mu.Lock()
cur := int(time.Now().Unix() / int64(sw.interval.Seconds()))
min := cur - int(sw.windowSize.Seconds()/sw.interval.Seconds())
for id := range sw.counts {
if id < min {
delete(sw.counts, id)
}
}
sw.mu.Unlock()
}
}
type HotKeyProxy struct {
hotKeyMap map[string]*SlidingWindow
mu sync.RWMutex
threshold int
}
func NewHotKeyProxy(threshold int) *HotKeyProxy {
return &HotKeyProxy{hotKeyMap: make(map[string]*SlidingWindow), threshold: threshold}
}
func (h *HotKeyProxy) HandleRequest(key string) bool {
h.mu.RLock()
window, ok := h.hotKeyMap[key]
h.mu.RUnlock()
if !ok {
h.mu.Lock()
h.hotKeyMap[key] = NewSlidingWindow(10*time.Second, 1*time.Second)
window = h.hotKeyMap[key]
h.mu.Unlock()
}
window.Incr()
return window.TotalCount() >= h.threshold
}Pros : no client or Redis changes, second‑level detection granularity, real‑time (second‑level) alerts.
Cons : requires deployment and maintenance of a proxy layer, which can become a new bottleneck if not clustered.
Real‑time analytics platform (Kafka → Flink/Spark)
Cache access logs (e.g., Nginx or application logs) are streamed to Kafka, then processed by Flink or Spark Streaming to aggregate per‑key request counts. Hot keys are identified when the count exceeds a preset threshold, enabling minute‑level alerts. This solution scales to billions of requests per minute.
Pros : handles massive traffic, supports multi‑dimensional analysis (IP, time window, etc.).
Cons : high architectural complexity, requires Kafka/Flink deployment, detection latency is minute‑level.
Hot Key Mitigation Strategies
Local cache degradation (most common)
Cache hot‑key values in the application process using a zero‑GC in‑memory cache such as freecache. The local cache is consulted first; on miss the value is fetched from Redis and stored locally with a short TTL (e.g., 10 s). This reduces the request rate to the Redis node.
package main
import (
"context"
"log"
"time"
"github.com/coocood/freecache"
"github.com/redis/go-redis/v9"
)
type LocalCacheDegrade struct {
localCache *freecache.Cache
rdb *redis.ClusterClient
hotKeys map[string]bool
mu sync.RWMutex
expireTime int // seconds
}
func NewLocalCacheDegrade(rdb *redis.ClusterClient, expireTime int) *LocalCacheDegrade {
lc := freecache.NewCache(1024 * 1024 * 100) // 100 MB
return &LocalCacheDegrade{localCache: lc, rdb: rdb, hotKeys: make(map[string]bool), expireTime: expireTime}
}
func (l *LocalCacheDegrade) AddHotKey(key string) {
l.mu.Lock()
l.hotKeys[key] = true
l.mu.Unlock()
}
func (l *LocalCacheDegrade) Get(ctx context.Context, key string) (string, error) {
l.mu.RLock()
isHot := l.hotKeys[key]
l.mu.RUnlock()
if isHot {
if v, err := l.localCache.Get([]byte(key)); err == nil {
return string(v), nil
}
val, err := l.rdb.Get(ctx, key).Result()
if err != nil {
return "", err
}
l.localCache.Set([]byte(key), []byte(val), l.expireTime)
return val, nil
}
return l.rdb.Get(ctx, key).Result()
}Pros : low development cost, immediate latency reduction.
Cons : consumes memory on each client, possible lock contention under extreme concurrency.
Hot‑key sharding (split static hot keys)
A static hot key is split into multiple sub‑keys (e.g., product:10086_0 … product:10086_9) that store the same value. Requests randomly pick a sub‑key, spreading the load across several Redis slots.
package main
import (
"context"
"log"
"math/rand"
"time"
"github.com/redis/go-redis/v9"
)
type HotKeyShard struct {
rdb *redis.ClusterClient
shardCount int
hotKeys map[string]bool
mu sync.RWMutex
}
func NewHotKeyShard(rdb *redis.ClusterClient, shardCount int) *HotKeyShard {
return &HotKeyShard{rdb: rdb, shardCount: shardCount, hotKeys: make(map[string]bool)}
}
func (h *HotKeyShard) AddHotKey(key string) {
h.mu.Lock()
h.hotKeys[key] = true
h.mu.Unlock()
}
func (h *HotKeyShard) getShardKey(key string) string {
h.mu.RLock()
isHot := h.hotKeys[key]
h.mu.RUnlock()
if !isHot {
return key
}
idx := rand.Intn(h.shardCount)
return key + "_" + string(rune(idx))
}
func (h *HotKeyShard) Set(ctx context.Context, key, val string, expire time.Duration) error {
for i := 0; i < h.shardCount; i++ {
realKey := key + "_" + string(rune(i))
if err := h.rdb.Set(ctx, realKey, val, expire).Err(); err != nil {
return err
}
}
return nil
}
func (h *HotKeyShard) Get(ctx context.Context, key string) (string, error) {
shardKey := h.getShardKey(key)
return h.rdb.Get(ctx, shardKey).Result()
}Cache warm‑up (prevent cache breakdown)
Before a known traffic spike (e.g., flash‑sale start), hot keys are pre‑loaded into Redis and optionally into the local cache. This eliminates the initial miss surge.
package main
import (
"context"
"log"
"time"
"github.com/coocood/freecache"
"github.com/redis/go-redis/v9"
)
type CacheWarmup struct {
rdb *redis.ClusterClient
localCache *freecache.Cache
expireTime time.Duration
}
func NewCacheWarmup(rdb *redis.ClusterClient, lc *freecache.Cache, et time.Duration) *CacheWarmup {
return &CacheWarmup{rdb: rdb, localCache: lc, expireTime: et}
}
func (c *CacheWarmup) Warmup(ctx context.Context, hotKeys []string, getData func(string) (string, error)) error {
log.Printf("starting warmup for %d hot keys", len(hotKeys))
for _, k := range hotKeys {
val, err := getData(k)
if err != nil {
log.Printf("warmup %s failed: %v", k, err)
continue
}
if err := c.rdb.Set(ctx, k, val, c.expireTime).Err(); err != nil {
log.Printf("redis set %s failed: %v", k, err)
continue
}
c.localCache.Set([]byte(k), []byte(val), int(c.expireTime.Seconds()))
log.Printf("warmup %s success", k)
}
log.Println("cache warmup completed")
return nil
}Circuit‑breaker throttling (fallback protection)
Sentinel is used to limit QPS for hot‑key requests. When the threshold is exceeded, the request returns a static fallback (e.g., "System busy, try later").
package main
import (
"context"
"log"
"github.com/alibaba/sentinel-golang/api"
"github.com/alibaba/sentinel-golang/core/flow"
"github.com/redis/go-redis/v9"
)
type HotKeyCircuitBreaker struct {
rdb *redis.ClusterClient
}
func NewHotKeyCircuitBreaker(rdb *redis.ClusterClient, threshold int) *HotKeyCircuitBreaker {
if err := api.InitDefault(); err != nil {
log.Fatalf("sentinel init failed: %v", err)
}
_, err := flow.LoadRules([]*flow.Rule{{Resource: "hotkey_request", MetricType: flow.QPS, Count: float64(threshold), ControlBehavior: flow.Reject}})
if err != nil {
log.Fatalf("load rule failed: %v", err)
}
return &HotKeyCircuitBreaker{rdb: rdb}
}
func (c *HotKeyCircuitBreaker) Get(ctx context.Context, key string) (string, error) {
entry, err := api.Entry("hotkey_request", api.WithResourceType(api.ResTypeCommon))
if err != nil {
log.Printf("hotkey %s throttled", key)
return "System busy, try later", nil
}
defer entry.Exit()
return c.rdb.Get(ctx, key).Result()
}Read‑write splitting (reduce read pressure)
Read requests are routed to Redis replica nodes while writes go to the primary. Combined with local cache, this keeps read latency low even under heavy hot‑key traffic.
Real‑World Cases
Case 1 – E‑commerce flash‑sale (static hot key) : product:10086 generated 800 k QPS, Redis CPU 96 %. After applying pre‑warm, 10‑shard distribution and local cache, node CPU dropped to 28 %, latency to 18 ms, cache hit rate 99.8 %.
Case 2 – Social platform celebrity profile (dynamic hot key) : user:8888 spiked to 500 k QPS. Read‑write splitting plus a 5 s local cache reduced replica CPU to 30 %, read success 99.9 %, latency 12 ms.
Case 3 – News breaking event (burst hot key) : news:9999 reached 300 k QPS. Circuit‑breaker (10 k QPS limit), 16‑shard distribution and emergency warm‑up restored the system within 10 minutes; Redis CPU stabilized at 25 %, DB CPU at 30 %.
Conclusion
In ten‑million‑QPS environments, hot keys are a primary cause of cache node overload, cache breakdown, and cascading system failures. Selecting the appropriate detection method—client instrumentation for early stages, Redis HOTKEYS for cluster‑level precision, proxy interception for zero‑intrusion, or real‑time analytics for massive scale—directly influences mitigation effectiveness.
Mitigation should focus on dispersing traffic: local cache degradation, hot‑key sharding, cache warm‑up, circuit‑breaker throttling, and read‑write splitting. Combining these techniques according to hot‑key type (static, dynamic, burst) yields high availability and sub‑20 ms latency even under extreme load.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture & Thinking
🍭 Frontline tech director and chief architect at top-tier companies 🥝 Years of deep experience in internet, e‑commerce, social, and finance sectors 🌾 Committed to publishing high‑quality articles covering core technologies of leading internet firms, application architecture, and AI breakthroughs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
