CTrip Redis Governance Evolution: Horizontal Scaling and Shrinking Solution
This article describes how CTrip tackled rapid Redis cluster growth by moving from vertical scaling to a containerized horizontal splitting approach, then introduced a binlog‑server based horizontal expansion and shrinkage method that reduces operation time, eliminates data migration, supports both scaling up and down, and improves resource utilization.
Background: CTrip's Redis clusters grew quickly in size and data volume, leading to challenges with vertical scaling, oversized instances, and inefficient resource usage.
Vertical scaling reached its limits because single instances exceeding 15 GB caused operational risk, and host capacity could not be expanded indefinitely.
To control instance size, CTrip first implemented a horizontal splitting strategy based on a two‑level consistent‑hash tree, allowing large instances to be divided into smaller leaf groups.
However, the horizontal split had drawbacks: long duration, double migration, inability to shrink, and performance overhead.
Redis Horizontal Scaling Design: The team adopted a business‑dual‑write cluster concept and, inspired by cloud‑native immutable infrastructure, built a solution using a kvrocks‑based intermediate binlog server that acts as a slave of the old cluster and a client of the new cluster.
Key steps:
Deploy binlog servers for each shard of the V1 cluster and obtain V2 hash rules.
Each binlog server replicates V1 master data (RDB files), parses commands, and writes them to V2 according to the new hash.
When synchronization is near complete, stop writes to V1, push V2 configuration, and switch applications to V2 transparently.
Benefits:
Significantly reduced scaling time (10 minutes for 20 GB instances, 5 minutes for <10 GB).
Only one cluster‑pointer switch, no intermediate migration, no extra memory pressure.
Supports both expansion and shrinkage, and can revert previously split clusters.
Enables rapid migration across networks (e.g., OpenStack to Cilium) without prolonged manual effort.
No performance loss after scaling.
Operational Data: Over four months, more than 200 expansions/shrinkages were performed; large traffic spikes were handled with scaling cycles under 10 minutes, and resource utilization improved by shrinking under‑utilized shards.
Pitfalls:
Very large keys (>3 GB) can cause eviction in the new cluster; mitigations include alerting on keys >512 MB and setting high maxmemory during split.
mget latency may increase after splitting due to more shards; recommend limiting keys per mget or using hash structures with hmget.
Future Plans: Integrate Xpipe support for DR clusters to avoid manual post‑scaling steps, and explore persistent KV storage alternatives for use‑cases requiring disk‑based durability and advanced atomic operations.
Recommended Reading: Links to related articles on Redis governance, database release system evolution, ClickHouse log analysis, and Dubbo timeout troubleshooting.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.