Backend Development 24 min read

Design and Evolution of Zhihu's Read Service: From Bloom Filter to RBase and TiDB Migration

This article details the architectural design, technical challenges, and iterative evolution of Zhihu's massive read‑service, covering early Bloom‑filter solutions, the RBase cache‑through system, high‑availability and performance optimizations, and the final migration to TiDB for cloud‑native scalability.

High Availability Architecture
High Availability Architecture
High Availability Architecture
Design and Evolution of Zhihu's Read Service: From Bloom Filter to RBase and TiDB Migration

Zhihu operates a large‑scale knowledge platform with billions of read records, requiring a read‑service that can handle trillion‑scale data, high write throughput (up to 40 K writes per second), low latency (<90 ms), and long‑term storage.

The initial solution stored Bloom filters on a Redis cluster, but batch bit operations were costly and memory usage was high. The team then moved to HBase, mapping user IDs to rows and document IDs to qualifiers, which improved scalability but suffered from cache‑miss latency due to sparse access patterns.

To meet the growing demands, a new cache‑through system called RBase was built on top of a BigTable‑style data model (RBase). RBase provides high availability, high performance, and easy extensibility by separating stateless proxies, layered caches, and a MySQL cluster managed by MHA. Weak‑state components can be rebuilt from replicas, and the system is orchestrated with Kubernetes.

Key components include:

Proxy: load‑balances slots, binds sessions to replicas, and falls back to other slots on failure.

Cache: uses Bloom filters for density, write‑through to avoid invalidations, and read‑through to coalesce concurrent queries.

MySQL (TokuDB): stores raw read records with compression, handling over a trillion rows (~13 TB).

Performance metrics in 2019 showed 40 K writes/s, 30 K QPS, and P99/P999 response times of 25 ms/50 ms.

Recognizing MySQL’s operational limits, the service was cloud‑native‑ified by migrating to TiDB, a MySQL‑compatible distributed database. Data migration used TiDB Lightning for bulk load and DM for incremental sync, completing a 45 TB transfer in four days.

After migration, query latency initially spiked due to full‑user data reconstruction; the team introduced priority‑based SQL hints, low‑precision timestamps, and prepared‑statement reuse to keep the critical path fast. TiDB Binlog was adapted to Kafka with partitioning to avoid bottlenecks.

Post‑migration metrics remained comparable to MySQL, and the service now enjoys horizontal scalability, high availability, and readiness for future traffic growth.

Lessons learned emphasize designing for high availability, performance, and extensibility from the start, abstracting reusable patterns, and embracing cloud‑native technologies.

Cloud Nativebackend architecturehigh availabilitycache designread serviceTiDB migration
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.