Understanding and Solving BigKey and HotKey Issues in Redis Clusters
BigKey and HotKey are common Redis cluster problems that can degrade performance, cause timeouts, network congestion, and even system-wide failures; this article explains their definitions, impacts, detection methods, and practical mitigation strategies—including key splitting, local caching, and migration optimizations—based on real-world production cases.
Problem Severity
BigKey and HotKey are frequent issues in Redis clusters that not only reduce service performance but also affect user experience, potentially leading to large‑scale service outages, economic loss, and brand damage. Both developers and DBAs must actively prevent and mitigate these problems.
What Are BigKey and HotKey?
BigKey
A BigKey refers to a Redis key whose stored value occupies an excessively large amount of memory. Typical thresholds are:
String type: value larger than 1 MB.
Non‑string types (hash, list, set, zset, etc.): element count exceeding 2 000.
HotKey
A HotKey is a key that receives a disproportionately high number of requests within a short period, causing a single shard to become a performance bottleneck. Examples include a key receiving 7 000 QPS out of a total 10 000 QPS on a shard, or a large hash that is repeatedly accessed with HGETALL.
Impact of BigKey and HotKey
BigKey Issues
1. Massive request timeouts – because Redis is single‑threaded, a slow BigKey response blocks subsequent commands. 2. Bandwidth congestion – large values consume significant network resources. 3. Memory overflow or processing blockage – large keys can cause memory exhaustion, long deletion times, and master‑slave sync anomalies.
HotKey Issues
1. Shard service paralysis – the overloaded shard may become unresponsive. 2. Diminished cluster advantages – uneven request distribution weakens Redis’s distributed nature. 3. Potential financial loss – in extreme cases, delayed processing can cause order‑related losses. 4. Cache breakdown – excessive Redis load forces fallback to the database, risking a system‑wide avalanche. 5. High CPU usage – the hot shard monopolizes CPU, affecting other shards.
How to Detect BigKey and HotKey
Business‑driven analysis
Examine use‑case scenarios (e.g., unbounded shopping‑cart keys, massive activity‑qualification lists) to anticipate large or hot keys.
Redis commands
Redis 4.0+ provides --bigkeys and --hotkeys options. Example usage:
redis-cli -a
--bigkeys redis-cli -a
--hotkeysTools
Visual clients (e.g., Another Redis Desktop Manager), open‑source utilities such as redis‑rdb‑tools , and internal platforms (e.g., DaaS) can help locate large keys.
RDB file analysis
Export the RDB snapshot and analyze it with tools like rdb‑tools to enumerate oversized keys.
Mitigation Strategies
BigKey solutions
Split the large value into multiple smaller keys (e.g., fragment a huge JSON into several keys via MSET , divide a large list into list_1 , list_2 , …). Apply similar partitioning for hashes, sets, and sorted sets.
HotKey solutions
Introduce a local cache (e.g., Caffeine) on the client side to reduce hot‑key traffic to Redis. Ensure cache size remains manageable and that expiration policies stay consistent with Redis.
Production Cases (vivo team)
The vivo database team operates >4.5 × 10⁴ Redis instances across >2 200 clusters. A full‑network BigKey scan would take years, so they rely on targeted analysis.
Typical BigKey sources include:
Statistics keys that continuously record user IPs.
Cache‑aside patterns that serialize massive datasets into a single key.
Queue usage where unconsumed items accumulate.
Case studies:
Timeout blocking caused by a hash with >12 million fields; after splitting the hash by date or secondary hash, latency spikes disappeared.
Network congestion from a 10 MB BigKey accessed 100 times per second, overwhelming a 1 Gbps NIC.
Migration failures during horizontal scaling because a single BigKey prolonged the MIGRATE command, exceeding the migrate timeout and triggering master‑slave failover.
Optimization of Detection and Analysis
To speed up cluster‑wide scans, the team runs --bigkeys on slave nodes in parallel (up to 10 concurrent analyses), limits results to the top‑50 keys per type, and adds pause/resume capabilities.
Horizontal Scaling Migration Improvements
Adjustments include extending cluster-node-timeout to 15 minutes, fixing migrate timeout to 10 seconds with three retries spaced 30 seconds apart, and enhancing logs to record node, slot, and key information for rapid troubleshooting.
Summary
Prevent BigKey and HotKey at the source, educate developers, and establish robust detection mechanisms. Apply key‑splitting, local caching, and refined migration parameters to maintain Redis performance and reliability in large‑scale production environments.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.