Databases 11 min read

Technical Challenges and Solutions for Migrating Zhihu's Self‑Managed MongoDB Cluster to Alibaba Cloud

The article analyzes the storage, sharding, backup, and operational pain points of Zhihu's self‑operated MongoDB cluster, proposes cloud‑based architectural and procedural solutions, and details a step‑by‑step migration plan that ensures zero‑downtime, improved stability, and cost efficiency.

Zhihu Tech Column

Dec 25, 2024

Technical Challenges and Solutions for Migrating Zhihu's Self‑Managed MongoDB Cluster to Alibaba Cloud

Zhihu's security anti‑fraud system stores massive, complex data in a self‑built MongoDB cluster, which faces four major issues: rapid storage growth requiring frequent node expansion, hotspot sharding caused by imbalanced data rules, increasingly long backup intervals that raise data‑loss risk, and heavy manual operations that threaten stability.

To address these problems, Zhihu collaborated with Alibaba Cloud experts and recommended several solutions: decoupling storage and compute resources while adopting a chunk‑based proactive pre‑sharding strategy; enabling elastic IOPS scaling to handle load spikes; switching to snapshot‑based backups with high‑frequency (15‑minute) intervals; and adopting cloud‑disk (ESSD) plus AutoPL performance for the MongoDB service.

The migration plan follows five core principles: isolating migration from ETL cut‑over to reduce external variables, using flexible and controllable sync rates via Alibaba Cloud DTS, scripting all cut‑over operations for speed and repeatability, conducting thorough rehearsals with rollback plans, and confirming data integrity before cut‑over.

Detailed steps include environment preparation (network, instance creation, permission setup), data synchronization (forward and reverse DTS tasks, full‑ and incremental sync), data validation (full and incremental checks), cut‑over preparation (service image slimming, script development and verification), and the final cut‑over execution with post‑migration monitoring and resource cleanup.

Post‑migration benefits observed are significant resource cost savings, reduced operational workload thanks to a unified cloud database console, resolution of historical data‑distribution issues, and enhanced high‑availability through multi‑AZ deployment and built‑in backup/recovery features.

During the migration, unexpected issues such as ineffective connection‑string updates in Go MongoDB drivers, uneven pre‑sharding due to an active Balancer, and potential performance impacts of new sharding strategies were identified and resolved through redeployment, Balancer disabling, and performance testing.

Overall, the half‑month migration completed across four critical cut‑over windows with zero service disruption, demonstrating the effectiveness of the proposed architecture and processes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Computing database migration MongoDB data backup Alibaba Cloud Operational Efficiency

Written by

Zhihu Tech Column

Sharing Zhihu tech posts and exploring community technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.