Mastering Redis Cluster in Production: Real-World Practices from VIPShop
This article shares VIPShop's extensive production experience with Redis Cluster, covering use cases, storage architecture evolution, detailed best‑practice guidelines, common pitfalls, operational automation, monitoring strategies, and useful open‑source tools for large‑scale deployments.
Outline
The presentation covers four main parts: production use cases, storage architecture evolution, application best practices, and operations experience.
1. Production Use Cases
1.1 Business Scope
Redis Cluster is used as an in‑memory storage service for backend workloads at Vipshop, supporting real‑time recommendation/ETL, risk control, and marketing systems.
It replaces a three‑layer Twemproxy architecture, simplifying the storage stack and enabling online node scaling.
Vipshop currently runs dozens of clusters with about 2,000 instances; a single cluster can have over 250 instances.
1.2 Characteristics of Big‑Data, Risk‑Control, and Marketing Systems
Clusters store tens of GB to multiple TB of data.
Data sources include:
Kafka → Redis Cluster (real‑time Storm/Spark)
Hive → Redis Cluster (MapReduce)
MySQL → Redis Cluster (Java/C++)
High read/write volume with strict performance requirements.
Peak traffic spikes increase load severalfold, requiring many Redis instances.
Frequent schema changes and rapid business requirement shifts.
Frequent scaling during major promotional events.
1.3 Why Choose Redis Cluster
1) Fits backend production scenarios
Horizontal scaling capability.
Failover and high availability.
Backend can tolerate minor data loss after failover.
2) Simpler architecture
No central component; slaves provide redundancy, masters promote on failure.
Replaces Twemproxy, reducing system complexity.
Saves hardware resources; eliminates a thousand‑plus physical machines previously used for LVS + Twemproxy.
Read/write latency improves from 100‑200 µs to 50‑100 µs.
Fewer bottlenecks compared to the previous three‑layer setup.
2. Storage Architecture Evolution
2.1 Evolution Timeline
In July 2014, to prepare for a major sales event, Vipshop migrated single Redis instances to Twemproxy for sharding and scaling. Later, Twemproxy's limitations and resource waste prompted a switch to Redis Cluster.
Redis Cluster was adopted after its GA, initially using version 3.0.2, then 3.0.3, and later 3.0.7.
2.2 Twemproxy Architecture
Advantages
Transparent sharding for developers; API identical to single Redis.
Can act as cache and storage proxy (auto‑eject).
Disadvantages
Complex multi‑layer architecture (LVS, Twemproxy, Redis, Sentinel, control programs).
High management and hardware costs.
Network bottlenecks (e.g., 2 × 1 Gbps NICs max ~1.4 Mpps).
Scaling limitations of Redis layer.
2.3 Redis Cluster Architecture
Advantages
Decentralized design.
Data distributed across slots on multiple instances.
Slave replicas provide standby for automatic failover.
Gossip protocol and voting enable rapid role promotion.
Supports manual failover for upgrades and migrations.
Reduces hardware and operational costs while improving scalability and availability.
Disadvantages
Client implementation complexity; requires smart client with slot mapping.
JedisCluster is the most mature Java client, but still has issues like “max redirect” exceptions.
Immature clients can affect stability and increase development difficulty.
Nodes may be mistakenly marked offline due to long‑running commands, causing unnecessary failover.
Cluster diagram:
3. Application Best Practices
Cluster stability assessment.
Common pitfalls.
Development guidelines & best practices.
3.1 Stability
Clusters are very stable when not scaling.
During resharding, early Jedis versions may throw “max‑redirect” errors; retry limits may be reached.
Health‑check mechanism flaws can cause unnecessary failover when a master is slow or blocked.
Optimization Strategies
a) Increase default
cluster-node-timeout(default 15 s) as needed.
b) Avoid long‑blocking commands (e.g.,
SAVE,
FLUSHDB) and slow
KEYSpatterns.
3.2 Common Pitfalls
1) Jedis “Max Redirect” during migration
Retry logic needed; increase
DEFAULT_MAX_REDIRECTIONS(default 5).
Avoid multi‑key commands like
MSET/
MGETwhich some clients don’t support.
2) Unnecessary failover caused by long‑blocking commands
Blockers:
SAVE,
FLUSHALL,
FLUSHDB.
Slow queries:
KEYS *, large keys, O(N) operations.
Rename dangerous commands (e.g., rename
FLUSHDB).
3) IPv4/IPv6 binding issues
Specify IPv4 bind address (e.g.,
bind 0.0.0.0) to ensure nodes join the cluster.
4) Slow data migration
Use
redis-trib.rb reshard; pre‑3.0.6 migrates one key at a time, later versions support batch migration.
Only one slot can migrate at a time; use
redis-trib.rb fixif interrupted.
5) Version selection & upgrade
Vipshop runs 3.0.7 with many 3.2.0 bug fixes back‑ported.
Testing 3.2.0 shows significant memory optimizations.
3.3 Fault‑Tolerance Practices
Implement connection retry and reconnection; retry timeout should exceed
cluster-node-timeout.
3.4 Development Guidelines
Monitor slow queries, avoid hot‑keys and big‑keys.
Set reasonable TTLs to prevent mass expirations.
Follow naming conventions and avoid blocking operations or transactions.
3.5 Connection Pool Tuning
Limit server‑side connections.
Configure appropriate pool size and heartbeat intervals.
Release connections promptly.
Address Jedis connection creation issues (see GitHub issue #1252).
3.6 When to Use Redis, Twemproxy, or Cluster
Redis: use pipelines and multi‑key ops for efficiency.
Twemproxy: supports pipelines and some multi‑key ops.
Redis Cluster: avoid pipelines and multi‑key ops to reduce “max‑redirect” scenarios.
3.7 Parameter Adjustments
Set
vm.overcommit_memory=1to avoid RDB/AOF failures.
Configure
timeout> 0 for idle connection cleanup.
Increase
repl-backlog-sizeto 64 MB for heavy write loads.
Adjust client output buffer limits (e.g.,
client-output-buffer-limit normal 256mb 128mb 60).
4. Operations Experience Summary
4.1 Automation Management
CMDB stores resource information.
Agents report hardware/software details.
Standardize OS/kernel parameters and software versions.
Puppet deploys configuration files, scheduled tasks, packages, and tools.
Self‑service resource provisioning.
4.2 Automated Monitoring
Zabbix collects monitoring data.
Real‑time performance dashboards for developers.
Deploy multiple Redis instances per host with Zabbix discovery.
Developed DB response time monitor “Titan” using flume → Kafka → Spark → HBase pipeline.
4.3 Automated Operations
One‑click cluster deployment via self‑service portal.
Monitoring data helps developers detect issues early (e.g., keys not expiring).
4.4 Open‑Source Redis Tools
Real‑time data migration tool (supports Redis, Twemproxy, Cluster) – GitHub
Redis Cluster management tool (batch parameter changes, rebalance) – GitHub
Multithreaded Twemproxy for higher throughput – GitHub
Multithreaded Redis client library – GitHub
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.