Design and Migration of Zhihu's Read Service: High Availability, Performance, and TiDB Adoption
This article details Zhihu's read‑service architecture, covering its business requirements, high‑availability and high‑performance design goals, key components such as Proxy, Cache and Storage, extensive performance metrics, the migration from MySQL to TiDB, and the benefits brought by TiDB 3.0 features.
Zhihu operates a massive knowledge platform with billions of answers and hundreds of millions of users, requiring an efficient read‑service to filter already‑seen content on the homepage and personalized push, while handling extremely high write and read throughput.
The system was designed around three core goals: high availability, high performance, and easy scalability. To achieve these, the architecture employs stateless proxies, layered Redis caches with slot‑based sharding, weak‑state services, and TiDB as the durable storage layer.
Key components include:
Proxy : a stateless layer that routes requests to appropriate cache slots, providing session consistency and automatic failover across cache replicas.
Cache : a multi‑layered design using Bloom filters, write‑through and read‑through strategies, slot‑based buffering, and tag‑based isolation to improve hit rates and reduce database pressure.
Storage : originally MySQL with sharding and MHA, later migrated to TiDB for better scalability, fault tolerance, and cloud‑native operation.
Performance metrics show the service handling over 40 k writes per second, 30 k independent queries per second, and 12 million document reads per second, with P99 latency around 25 ms and P999 around 50 ms.
The migration to TiDB involved using TiDB DM for binlog capture and TiDB Lightning for bulk import of roughly 1.1 trillion records, followed by incremental sync. Post‑migration tuning, including query isolation, SQL hints, low‑precision TSO, and prepared‑statement reuse, brought latency within strict limits.
TiDB 3.0 introduced features such as gRPC batch messages, multi‑threaded Raft stores, Plan Management, TiFlash, and the Titan storage engine, which further improved write throughput and latency for both the read‑service and an anti‑fraud system that requires extreme write rates.
In conclusion, the architecture demonstrates a cloud‑native, highly available, and scalable backend solution that can serve multiple business scenarios, emphasizing the importance of understanding workload characteristics, leveraging open‑source components, and continuously adopting new technologies.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.