Operations 16 min read

Weibo's Multi-Data-Center (Active‑Active) Architecture: Experience, Challenges, and Best Practices

The article details Weibo's journey in building a multi‑data‑center active‑active architecture, covering its evolution, technical challenges such as latency and data synchronization, the adopted MCQ‑based messaging solution, operational best practices, and future directions for high‑availability deployments.

High Availability Architecture
High Availability Architecture
High Availability Architecture
Weibo's Multi-Data-Center (Active‑Active) Architecture: Experience, Challenges, and Best Practices

Editor’s note: This article, originally submitted by Liu Daoru, shares valuable insights on high‑availability architecture and is reproduced with permission from the ArchNotes public account.

Weibo Multi‑Data‑Center (Active‑Active) Construction Timeline

Weibo’s early data‑center footprint was concentrated in Beijing, with a small deployment in Guangzhou. In October 2010, to support rapid growth, Weibo began a dual‑active deployment using a self‑developed MytriggerQ system that leveraged MySQL binlog triggers for cross‑region message synchronization. Although this approach ensured high maturity, it suffered from message ordering issues and cache inconsistencies, leading to its eventual replacement.

Subsequently, Weibo adopted a custom message queue, MCQ (MemcacheQ), and built the Weibo Message Broker (WMB) for cross‑region synchronization. By May 2012, the Guangzhou data‑center was live, achieving active‑active operation.

In mid‑2013, Beijing’s data‑center was split, creating a three‑node deployment that enabled online capacity assessment, staged rollouts, and rapid traffic balancing, significantly improving peak‑load handling and reducing failure‑induced incidents.

Challenges Faced in Multi‑Active Deployment

Inter‑data‑center latency: Beijing‑Beijing links exhibit ~1 ms latency, while Beijing‑Guangzhou links reach ~40 ms, a substantial fraction of the ~120 ms average Feed request latency, threatening performance if many services cross regions.

Dedicated lines: Two expensive Beijing‑Guangzhou dedicated lines are prone to monthly outages, adding cost and reliability concerns.

Data synchronization: Synchronizing MySQL, HBase, and various custom components across regions is complex, especially given network latency and line instability.

Service dependency deployment: Deploying numerous small services across regions incurs high migration and maintenance costs, yet avoiding deployment leads to unacceptable latency.

Supporting ecosystem: Full active‑active operation requires end‑to‑end changes in preview, release, monitoring, degradation, and traffic migration processes.

Weibo’s Multi‑Active Solution

Given the latency constraints, core services store redundant data and rely on caching to form a relatively independent service layer. Synchronization occurs at multiple levels—message, cache, and database—using the MCQ‑based WMB system. Cache updates are driven by messages; if a cache miss occurs, the request falls back to a MySQL replica, with a 10‑minute delayed queue handling potential stale data.

Each data‑center runs an independent cache refreshed by a dedicated Processor (similar to Storm). This design eliminates duplicate message delivery and resolves the dirty‑cache issues observed with MytriggerQ.

Database replication follows a master‑slave model; despite occasional lag, the three‑year‑old setup has remained stable without service outages. HBase, introduced in 2013, currently lacks multi‑region support, but plans are underway to adopt a similar MCQ‑based dual‑region deployment.

Best Practices for Active‑Active Deployments

Key considerations include assessing the necessity of multi‑active based on traffic volume, balancing resource and development costs, prioritizing whole‑service migration when resources are limited, simplifying service dependencies, and selecting a secondary data‑center with low latency (<10 ms) and reliable links.

Operationally, limit deployments to two regions until cross‑region services are fully supported, keep message payloads under 10 KB, and ensure message delivery guarantees. Use MCQ’s write‑remote‑read‑local pattern to achieve efficient, reliable synchronization.

Future Directions

From 2015 onward, Weibo plans to evolve the cross‑region messaging component into a dedicated service, abstracting synchronization complexity from business logic. The web layer will shift to near‑user data‑centers, leveraging public‑network routing algorithms to replace costly dedicated lines.

Micro‑service encapsulation will address small‑service dependency challenges, allowing a few micro‑services to be multi‑active while the rest remain single‑region. Docker will be employed to enable rapid, minute‑level scaling of front‑end resources during traffic spikes.

These reflections aim to inspire other engineers tackling multi‑data‑center high‑availability architectures.

distributed systemsdeploymenthigh availabilitycachingmessagingmulti-data center
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.