Databases 19 min read

The Evolution and Architecture of Alibaba's Data Replication Center (DRC)

This article chronicles the development, design decisions, and technical achievements of Alibaba's Data Replication Center (DRC), a real‑time heterogeneous database synchronization platform that has become a core infrastructure for multi‑active data centers, large‑scale e‑commerce, and cloud services.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
The Evolution and Architecture of Alibaba's Data Replication Center (DRC)

Tiān Yǔ sits there, looking back with deep affection at the memories of fighting alongside his teammates in the early days of DRC, the most challenging road they have ever taken.

What is DRC?

DRC (Data Replication Center) is a data‑flow product independently developed by Alibaba's technical assurance – database technology team. It supports real‑time synchronization of heterogeneous databases, change‑data capture, and subscription services, providing product‑level solutions for cross‑region real‑time sync, incremental distribution, active‑active disaster recovery, and distributed databases. During the 2015 Double‑11 event, DRC peaked at parsing ten million records per second with an average three‑node sync latency of 500 ms.

Today DRC is a fundamental infrastructure of Alibaba, supporting e‑commerce active‑active, big‑data real‑time extraction, search real‑time data, Ant Financial billing, and even the media wall for Double‑11. Hundreds of external users on Alibaba Cloud invoke DRC via the Data Transmission Service (DTS) to build their own disaster‑recovery systems. DRC is essentially the underlying fabric that moves data from production to consumption, much like a network.

How was DRC built?

DRC originated from the need to solve database pain points, driven by repeated business demands. Traditional MySQL master‑to‑slave replication was single‑threaded, causing latency for high‑write workloads such as Taobao's rating system. Hardware limitations (SAS disks) and the need for sharding further exacerbated the problem.

Taobao’s “going out of Hangzhou” project (starting 2009) required building a remote unit (Site A) for read traffic, evolving from cold standby to full‑scale active‑active by 2014. By 2011, Site A deployed dozens of servers handling the whole site’s MySQL workload, making single‑threaded replication untenable.

Two technical options existed: modify MySQL to apply logs concurrently, or build an external tool that pulls binlogs and writes to the slave concurrently. Lacking deep MySQL kernel expertise, the team chose the external approach.

In September 2011, under the guidance of MySQL kernel expert Xi Yǔ, the first version DRC 1.0 was released: a single process that fetched MySQL logs in real time and wrote to the slave concurrently. Subsequent releases (v1.1) addressed multi‑threaded log pulling and regex‑based table filtering.

Although functional, the early versions were complex to configure, unstable, and lacked proper investment, leading DBAs to doubt DRC’s value. Later, MySQL kernel expert Ding Qí created a “transfer” process that mimicked a slave, pulling logs and writing them concurrently, essentially a custom slave known only to DBAs.

DRC and the transfer tool were used ad‑hoc by DBAs to mitigate latency, but the lack of a complete pipeline and the need to invalidate Tair caches via binlog‑driven mechanisms drove further development.

By late 2011, after the primary database went offline, Site A’s read traffic no longer depended on the latency‑prone O‑record, prompting large‑scale adoption of DRC for cross‑region replication.

Post‑IOE migration, the explosion of MySQL instances, frequent sharding, and numerous downstream accounts (search, data‑warehouse, business logic) created a notification nightmare for DBAs. DRC’s design was therefore refactored to separate fetching and writing into two processes communicating via IPC, enabling one‑time fetch with multiple downstream deliveries.

In early 2012, a plug‑in architecture was drafted to abstract source and destination differences, laying the groundwork for a unified incremental sync center.

The team identified four primary requirements: (1) disaster‑recovery for remote read sites, (2) incremental sync for search, (3) migration support for Alibaba Cloud RDS version upgrades, and (4) data import from MySQL to OceanBase.

DRC’s three technical pillars were defined as strong consistency, sub‑second real‑time (<1 s), and high robustness.

Original architecture: plug‑in based, single‑node, requiring extensive operational support.

Target architecture: cluster‑managed, automatic failover, HTTP‑API driven task creation and management.

By mid‑2012, the team introduced Zookeeper for cluster coordination and began building a ClusterManager prototype.

In November 2012, Incremental Platform v1.0 launched, offering automatic master‑slave switch, DDL support, and fine‑grained filtering. Search chose DRC because it emitted heartbeat positions for any write activity, allowing the search system to detect task health.

DRC subsequently served Site A MySQL sync, Tair invalidation, search, and advertising incrementals. Despite moving to a clustered model, three separate process versions persisted for different scenarios.

Challenges remained: multiple downstream accounts required single‑account access, and the upcoming 2013 Taobao minimal‑unit environment demanded a unified incremental center.

Key decisions in 2012‑2013 included: (1) supporting Taobao’s minimal‑unit deployment, (2) consolidating accounts into an incremental center to reduce manual DBA notifications.

By mid‑2013, LevelDB‑based persistence and a DRC Queue were introduced, enabling durable incremental pipelines.

In July 2013, a unified DRC version was released, merging sync and incremental capabilities, adding features such as persistent queue engine, loop‑copy avoidance, one‑to‑many distribution, concurrent transaction replication, automatic disaster recovery, and automatic master‑slave switch.

DRC’s success rests on three core technical advantages: (1) patented concurrent transaction conflict algorithms ensuring high‑efficiency sync, (2) MetaBuilder that captures DDL changes in real time to maintain strong consistency, and (3) superior code efficiency in compression, bandwidth usage, latency, and throughput.

The project’s greatest asset has been the close collaboration with DBAs, whose relentless feedback and operational pressure drove continuous improvement.

From 2013 onward, DRC supported active‑active sites, search, data‑warehouse, and Alibaba Cloud RDS, marking the “infrastructure construction year.” In 2014, cross‑region active‑active and cloud integration expanded DRC’s reach, and in 2015 DRC became a cloud service via DTS.

Today DRC runs dozens of clusters, with the largest handling nearly 7,000 real‑time channels, and is evolving toward a next‑generation architecture capable of managing 50,000 tasks per cluster over the next three years.

Note: Original title – "【DRC前世今生】每个牛逼的故事背后,总有段基情故事要说"

Author – Tiān Yǔ, Alibaba Technical Assurance Database Senior Expert, joined Taobao DBA team in 2008, led the migration away from IOE, built the DBFree automated DBA platform, and has overseen DRC’s evolution supporting Double‑11 and multi‑IDC active‑active architectures.

Alibabadistributed systemsreal-time datadata replicationdatabase synchronizationDRC
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.