Weibo's Multi-Active Deployment: Practices, Challenges, and Solutions
The article details Weibo's evolution toward multi‑active cross‑datacenter deployment, describing initial motivations, early synchronization attempts, the eventual MCQ‑based messaging solution, operational challenges such as latency and data consistency, and best‑practice recommendations for large‑scale distributed systems.
Weibo initially pursued multi‑active deployment to achieve disaster recovery, improve access speed for southern China and overseas users, and reduce deployment costs. Early experiments highlighted benefits like dynamic acceleration and online load testing, while also revealing increased development complexity and storage costs.
Weibo External Journey
The majority of Weibo's data centers were in Beijing, with a small presence in Guangzhou. In October 2010, rapid growth prompted an expansion of Guangzhou servers and a plan for active‑active deployment across regions.
The first cross‑datacenter message sync used a custom MytriggerQ built on MySQL triggers, which offered high maturity but suffered from ordering issues and cache inconsistencies due to multiple tables per business.
After abandoning MytriggerQ, Weibo switched to MCQ (MemcacheQ) with a new component called WMB (Weibo Message Broker). By May 2012, the Guangzhou data center was live, achieving active‑active operation.
In mid‑2013, Beijing was split into two nodes, creating a three‑node deployment that enabled online capacity assessment, staged rollouts, and rapid traffic balancing, dramatically improving peak‑load handling and fault tolerance.
Challenges of Multi‑Active Deployment
Weibo identified several common challenges:
Inter‑data‑center latency (≈40 ms between Beijing and Guangzhou) significantly impacts performance for latency‑sensitive services.
Dedicated line reliability and cost; frequent outages increase operational risk.
Data synchronization for MySQL, HBase, and custom components under high latency and unstable networks.
Deploying dependent services across regions without excessive refactoring cost.
Comprehensive supporting systems (preview, release, monitoring, degradation, traffic migration) must be adapted for multi‑site operation.
Weibo's Multi‑Active Solution
To mitigate latency, core services store redundant data and use caching. Synchronization occurs at message, cache, and database layers. The MCQ‑based WMB pipeline updates caches, allowing the system to remain functional even if database sync lags.
Each site runs an independent cache refreshed by a Processor (similar to Storm) that consumes messages. A 10‑minute delayed queue handles potential cache‑penetration dirty data, which is acceptable for Weibo's use case.
Database replication relies on master‑slave sync; despite occasional lag, the approach has been stable for three years. HBase, introduced in 2013, is also being considered for multi‑site deployment using similar MCQ‑based sync.
For services with high latency sensitivity (e.g., Feed), Weibo deployed active‑active instances to keep response times within acceptable bounds.
Operational processes were refined to include cross‑site testing, monitoring, and automated failover, ensuring smooth migrations and minimal service disruption.
Best Practices for Multi‑Active Deployment
Key considerations include assessing migration feasibility, service dependency complexity, user partitioning suitability, selecting a nearby secondary site with reliable links, limiting deployment to two sites initially, and service‑level message synchronization with payloads under 10 KB.
Future Directions
Upcoming plans focus on turning the cross‑site messaging component into a dedicated service, moving web layers closer to users via near‑edge deployment, adopting micro‑services to isolate small‑service dependencies, and leveraging Docker for rapid scaling.
Disclaimer: The content is sourced from public channels, presented neutrally for reference and discussion only. Copyright belongs to the original authors.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.