Technical Challenges and Solutions for Migrating Zhihu's Self‑Managed MongoDB Cluster to Alibaba Cloud
The article analyzes the storage, sharding, backup, and operational pain points of Zhihu's self‑operated MongoDB cluster, proposes cloud‑based architectural and procedural solutions, and details a step‑by‑step migration plan that ensures zero‑downtime, improved stability, and cost efficiency.
Zhihu's security anti‑fraud system stores massive, complex data in a self‑built MongoDB cluster, which faces four major issues: rapid storage growth requiring frequent node expansion, hotspot sharding caused by imbalanced data rules, increasingly long backup intervals that raise data‑loss risk, and heavy manual operations that threaten stability.
To address these problems, Zhihu collaborated with Alibaba Cloud experts and recommended several solutions: decoupling storage and compute resources while adopting a chunk‑based proactive pre‑sharding strategy; enabling elastic IOPS scaling to handle load spikes; switching to snapshot‑based backups with high‑frequency (15‑minute) intervals; and adopting cloud‑disk (ESSD) plus AutoPL performance for the MongoDB service.
The migration plan follows five core principles: isolating migration from ETL cut‑over to reduce external variables, using flexible and controllable sync rates via Alibaba Cloud DTS, scripting all cut‑over operations for speed and repeatability, conducting thorough rehearsals with rollback plans, and confirming data integrity before cut‑over.
Detailed steps include environment preparation (network, instance creation, permission setup), data synchronization (forward and reverse DTS tasks, full‑ and incremental sync), data validation (full and incremental checks), cut‑over preparation (service image slimming, script development and verification), and the final cut‑over execution with post‑migration monitoring and resource cleanup.
Post‑migration benefits observed are significant resource cost savings, reduced operational workload thanks to a unified cloud database console, resolution of historical data‑distribution issues, and enhanced high‑availability through multi‑AZ deployment and built‑in backup/recovery features.
During the migration, unexpected issues such as ineffective connection‑string updates in Go MongoDB drivers, uneven pre‑sharding due to an active Balancer, and potential performance impacts of new sharding strategies were identified and resolved through redeployment, Balancer disabling, and performance testing.
Overall, the half‑month migration completed across four critical cut‑over windows with zero service disruption, demonstrating the effectiveness of the proposed architecture and processes.
Zhihu Tech Column
Sharing Zhihu tech posts and exploring community technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.