Databases 19 min read

Mastering Redis Cluster in Production: Real-World Practices from VIPShop

This article shares VIPShop's extensive production experience with Redis Cluster, covering use cases, storage architecture evolution, detailed best‑practice guidelines, common pitfalls, operational automation, monitoring strategies, and useful open‑source tools for large‑scale deployments.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering Redis Cluster in Production: Real-World Practices from VIPShop

Outline

The presentation covers four main parts: production use cases, storage architecture evolution, application best practices, and operations experience.

1. Production Use Cases

1.1 Business Scope

Redis Cluster is used as an in‑memory storage service for backend workloads at Vipshop, supporting real‑time recommendation/ETL, risk control, and marketing systems.

It replaces a three‑layer Twemproxy architecture, simplifying the storage stack and enabling online node scaling.

Vipshop currently runs dozens of clusters with about 2,000 instances; a single cluster can have over 250 instances.

1.2 Characteristics of Big‑Data, Risk‑Control, and Marketing Systems

Clusters store tens of GB to multiple TB of data.

Data sources include:

Kafka → Redis Cluster (real‑time Storm/Spark)

Hive → Redis Cluster (MapReduce)

MySQL → Redis Cluster (Java/C++)

High read/write volume with strict performance requirements.

Peak traffic spikes increase load severalfold, requiring many Redis instances.

Frequent schema changes and rapid business requirement shifts.

Frequent scaling during major promotional events.

1.3 Why Choose Redis Cluster

1) Fits backend production scenarios

Horizontal scaling capability.

Failover and high availability.

Backend can tolerate minor data loss after failover.

2) Simpler architecture

No central component; slaves provide redundancy, masters promote on failure.

Replaces Twemproxy, reducing system complexity.

Saves hardware resources; eliminates a thousand‑plus physical machines previously used for LVS + Twemproxy.

Read/write latency improves from 100‑200 µs to 50‑100 µs.

Fewer bottlenecks compared to the previous three‑layer setup.

2. Storage Architecture Evolution

2.1 Evolution Timeline

In July 2014, to prepare for a major sales event, Vipshop migrated single Redis instances to Twemproxy for sharding and scaling. Later, Twemproxy's limitations and resource waste prompted a switch to Redis Cluster.

Redis Cluster was adopted after its GA, initially using version 3.0.2, then 3.0.3, and later 3.0.7.

2.2 Twemproxy Architecture

Advantages

Transparent sharding for developers; API identical to single Redis.

Can act as cache and storage proxy (auto‑eject).

Disadvantages

Complex multi‑layer architecture (LVS, Twemproxy, Redis, Sentinel, control programs).

High management and hardware costs.

Network bottlenecks (e.g., 2 × 1 Gbps NICs max ~1.4 Mpps).

Scaling limitations of Redis layer.

2.3 Redis Cluster Architecture

Advantages

Decentralized design.

Data distributed across slots on multiple instances.

Slave replicas provide standby for automatic failover.

Gossip protocol and voting enable rapid role promotion.

Supports manual failover for upgrades and migrations.

Reduces hardware and operational costs while improving scalability and availability.

Disadvantages

Client implementation complexity; requires smart client with slot mapping.

JedisCluster is the most mature Java client, but still has issues like “max redirect” exceptions.

Immature clients can affect stability and increase development difficulty.

Nodes may be mistakenly marked offline due to long‑running commands, causing unnecessary failover.

Cluster diagram:

3. Application Best Practices

Cluster stability assessment.

Common pitfalls.

Development guidelines & best practices.

3.1 Stability

Clusters are very stable when not scaling.

During resharding, early Jedis versions may throw “max‑redirect” errors; retry limits may be reached.

Health‑check mechanism flaws can cause unnecessary failover when a master is slow or blocked.

Optimization Strategies

a) Increase default

cluster-node-timeout

(default 15 s) as needed.

b) Avoid long‑blocking commands (e.g.,

SAVE

,

FLUSHDB

) and slow

KEYS

patterns.

3.2 Common Pitfalls

1) Jedis “Max Redirect” during migration

Retry logic needed; increase

DEFAULT_MAX_REDIRECTIONS

(default 5).

Avoid multi‑key commands like

MSET

/

MGET

which some clients don’t support.

2) Unnecessary failover caused by long‑blocking commands

Blockers:

SAVE

,

FLUSHALL

,

FLUSHDB

.

Slow queries:

KEYS *

, large keys, O(N) operations.

Rename dangerous commands (e.g., rename

FLUSHDB

).

3) IPv4/IPv6 binding issues

Specify IPv4 bind address (e.g.,

bind 0.0.0.0

) to ensure nodes join the cluster.

4) Slow data migration

Use

redis-trib.rb reshard

; pre‑3.0.6 migrates one key at a time, later versions support batch migration.

Only one slot can migrate at a time; use

redis-trib.rb fix

if interrupted.

5) Version selection & upgrade

Vipshop runs 3.0.7 with many 3.2.0 bug fixes back‑ported.

Testing 3.2.0 shows significant memory optimizations.

3.3 Fault‑Tolerance Practices

Implement connection retry and reconnection; retry timeout should exceed

cluster-node-timeout

.

3.4 Development Guidelines

Monitor slow queries, avoid hot‑keys and big‑keys.

Set reasonable TTLs to prevent mass expirations.

Follow naming conventions and avoid blocking operations or transactions.

3.5 Connection Pool Tuning

Limit server‑side connections.

Configure appropriate pool size and heartbeat intervals.

Release connections promptly.

Address Jedis connection creation issues (see GitHub issue #1252).

3.6 When to Use Redis, Twemproxy, or Cluster

Redis: use pipelines and multi‑key ops for efficiency.

Twemproxy: supports pipelines and some multi‑key ops.

Redis Cluster: avoid pipelines and multi‑key ops to reduce “max‑redirect” scenarios.

3.7 Parameter Adjustments

Set

vm.overcommit_memory=1

to avoid RDB/AOF failures.

Configure

timeout

> 0 for idle connection cleanup.

Increase

repl-backlog-size

to 64 MB for heavy write loads.

Adjust client output buffer limits (e.g.,

client-output-buffer-limit normal 256mb 128mb 60

).

4. Operations Experience Summary

4.1 Automation Management

CMDB stores resource information.

Agents report hardware/software details.

Standardize OS/kernel parameters and software versions.

Puppet deploys configuration files, scheduled tasks, packages, and tools.

Self‑service resource provisioning.

4.2 Automated Monitoring

Zabbix collects monitoring data.

Real‑time performance dashboards for developers.

Deploy multiple Redis instances per host with Zabbix discovery.

Developed DB response time monitor “Titan” using flume → Kafka → Spark → HBase pipeline.

4.3 Automated Operations

One‑click cluster deployment via self‑service portal.

Monitoring data helps developers detect issues early (e.g., keys not expiring).

4.4 Open‑Source Redis Tools

Real‑time data migration tool (supports Redis, Twemproxy, Cluster) – GitHub

Redis Cluster management tool (batch parameter changes, rebalance) – GitHub

Multithreaded Twemproxy for higher throughput – GitHub

Multithreaded Redis client library – GitHub

operationsHigh AvailabilityRedisbest practicesscalingProductionRedis Cluster
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.