Databases 11 min read

Technical Challenges and Solutions for Migrating Zhihu's Self‑Managed MongoDB Cluster to Alibaba Cloud

The article analyzes the storage, sharding, backup, and operational pain points of Zhihu's self‑operated MongoDB cluster, proposes cloud‑based architectural and procedural solutions, and details a step‑by‑step migration plan that ensures zero‑downtime, improved stability, and cost efficiency.

Zhihu Tech Column
Zhihu Tech Column
Zhihu Tech Column
Technical Challenges and Solutions for Migrating Zhihu's Self‑Managed MongoDB Cluster to Alibaba Cloud

Zhihu's security anti‑fraud system stores massive, complex data in a self‑built MongoDB cluster, which faces four major issues: rapid storage growth requiring frequent node expansion, hotspot sharding caused by imbalanced data rules, increasingly long backup intervals that raise data‑loss risk, and heavy manual operations that threaten stability.

To address these problems, Zhihu collaborated with Alibaba Cloud experts and recommended several solutions: decoupling storage and compute resources while adopting a chunk‑based proactive pre‑sharding strategy; enabling elastic IOPS scaling to handle load spikes; switching to snapshot‑based backups with high‑frequency (15‑minute) intervals; and adopting cloud‑disk (ESSD) plus AutoPL performance for the MongoDB service.

The migration plan follows five core principles: isolating migration from ETL cut‑over to reduce external variables, using flexible and controllable sync rates via Alibaba Cloud DTS, scripting all cut‑over operations for speed and repeatability, conducting thorough rehearsals with rollback plans, and confirming data integrity before cut‑over.

Detailed steps include environment preparation (network, instance creation, permission setup), data synchronization (forward and reverse DTS tasks, full‑ and incremental sync), data validation (full and incremental checks), cut‑over preparation (service image slimming, script development and verification), and the final cut‑over execution with post‑migration monitoring and resource cleanup.

Post‑migration benefits observed are significant resource cost savings, reduced operational workload thanks to a unified cloud database console, resolution of historical data‑distribution issues, and enhanced high‑availability through multi‑AZ deployment and built‑in backup/recovery features.

During the migration, unexpected issues such as ineffective connection‑string updates in Go MongoDB drivers, uneven pre‑sharding due to an active Balancer, and potential performance impacts of new sharding strategies were identified and resolved through redeployment, Balancer disabling, and performance testing.

Overall, the half‑month migration completed across four critical cut‑over windows with zero service disruption, demonstrating the effectiveness of the proposed architecture and processes.

cloud computingscalabilitydatabase migrationMongoDBdata backupAlibaba CloudOperational Efficiency
Zhihu Tech Column
Written by

Zhihu Tech Column

Sharing Zhihu tech posts and exploring community technology innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.