Databases 11 min read

Didi's Multi-Active Redis Architecture: Design, Challenges, and Solutions

To achieve disaster-recovery and cross-data-center resilience, Didi progressed from a simple proxy double-write scheme to a sophisticated MQ-free multi-active Redis design that uses a dedicated syncer, shard-based loop prevention, op-id replay protection, conflict detection, and incremental AOF durability, ensuring low latency, no data loss, and consistent availability.

Didi Tech
Didi Tech
Didi Tech
Didi's Multi-Active Redis Architecture: Design, Challenges, and Solutions

To ensure disaster recovery and enable services to survive data‑center‑level failures, Didi deploys its storage services across multiple data centers. This article briefly analyzes several approaches for implementing Redis multi‑active (active‑active) deployment and shares the main problems and solutions encountered during the evolution of Didi’s Redis multi‑active architecture.

Common industry approaches

1. Master‑slave architecture: each data center runs a Redis‑master that handles all write traffic, while Redis‑slaves in each data center serve read requests. A Proxy layer routes reads/writes to the local Redis instance.

2. Proxy double‑write architecture: each data center has an independent Redis cluster; the Proxy writes locally and asynchronously forwards the write to the remote data center.

3. Data‑layer bidirectional synchronization: the Proxy does not handle data sync; a dedicated synchronization component replicates data between Redis servers.

Pros and cons of each approach

Master‑slave: simple to implement, leverages native Redis replication for consistency, but suffers from high write latency in remote data centers and requires manual master failover.

Proxy double‑write: simple at the Proxy layer and isolates failures between data centers, but cannot guarantee data consistency because asynchronous writes may be lost, and it does not support bulk data migration.

Bidirectional sync: avoids write latency and supports data migration, but adds complexity to the Redis server and introduces longer sync chains.

Didi’s architecture evolution

First generation – based on Codis. The Proxy performed double‑write, quickly achieving multi‑active capability but suffered from data loss during network failures.

Second generation – based on Kedis. Incremental data is written to a local MQ; a consumer in the remote data center pulls from MQ and writes to its Redis. This design tolerates network partitions because data accumulates in MQ, but introduces new problems: MQ throttling may drop data, internal queues may discard overflow, cross‑team dependencies increase, non‑idempotent commands may be replayed, and bulk data migration is unsupported.

Third generation – introduces a dedicated syncer component that pulls incremental AOF data directly from Redis, eliminating MQ and reducing latency. The design also adds mechanisms to solve four critical issues:

Loop prevention: each Redis instance is assigned a unique shardID and every request carries opinfo containing the shard identifier. The remote side discards requests whose shardID matches its own, preventing data from looping back.

Replay protection: every request receives a unique opid . The remote Redis stores the highest opid per shard; duplicate opid values are ignored, ensuring at‑most‑once execution.

Conflict detection: Redis records the last write timestamp for each key. If an incoming request’s timestamp is older than the stored timestamp, the operation is flagged as a conflict for later analysis.

Incremental data durability: the AOF mechanism is enhanced to split files, perform incremental replication, and write asynchronously. During a network outage, incremental writes are persisted on disk; after recovery, the syncer retrieves the missing AOF segments and forwards them to the remote data center, guaranteeing no data loss.

Additional improvements include an asynchronous AOF writer that moves the write workload from the main Redis thread to a background I/O thread, eliminating latency spikes caused by disk I/O.

Overall, Didi’s multi‑active Redis solution evolves from a quick proxy‑based double‑write to a robust, MQ‑free architecture with loop prevention, replay protection, conflict detection, and reliable incremental persistence, providing high availability and data consistency across geographically distributed data centers.

distributed systemsarchitecturehigh availabilityredisdata replicationmulti-activeDidi
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.