How Zhihu Scaled from 2 Engineers to 100M Users: Backend Architecture Lessons

This article recounts Zhihu's evolution from a tiny Python‑Tornado service on a single Linode to a massive, highly available backend employing custom logging, event‑driven processing, page‑render optimizations, and a service‑oriented architecture that now supports over 100 million users.

21CTO
21CTO
21CTO
How Zhihu Scaled from 2 Engineers to 100M Users: Backend Architecture Lessons

Zhihu is the third‑largest Chinese UGC community after Baidu Tieba and Douban, with over 11 million registered users, 80 million monthly active users, and more than 2.2 billion page views per month.

Initial Architecture Choices

When development began in October 2010, the team consisted of only two engineers, expanding to four by launch in December. The primary language was Python for its simplicity and rapid development, and the Tornado framework was chosen for its asynchronous capabilities, fitting the need for long‑lived comet connections.

Early infrastructure relied on a 512 MB Linode VM to save costs, but rapid user growth exposed latency and reliability issues, prompting a move to self‑hosted servers in a data center and the implementation of web and database high‑availability with master‑slave replication.

The architecture diagram shows a master‑slave setup for both web and database layers, read‑write separation, an offline‑script server to avoid impacting online latency, and upgraded internal networking that increased throughput twenty‑fold.

Logging System

With the opening of public registration in late 2011, the need to filter spam and ads led to the development of a distributed logging system named Kids (Kids Is Data Stream). Kids supports distributed collection, centralized storage, real‑time processing, subscription, and simplicity.

Inspired by Scribe, each server runs a Kids Agent that aggregates messages and forwards them to either another Agent or the central Server. Subscribers can pull logs from the Server or from any Agent.

Kids also powers a web tool called Kids Explorer for real‑time log inspection, which has been open‑sourced on GitHub.

Event‑Driven Architecture

As features grew, maintaining procedural update logic became untenable, so Zhihu introduced an event‑driven design. A custom message queue called Sink persists events locally before dispatching them. Beanstalkd handles task queuing, allowing parallel processing while ensuring durability.

Example: when a user answers a question, the answer is stored in MySQL, the event is sent to Sink, which forwards it via Miller to Beanstalkd; workers then process the task.

Initially the system handled 10 messages per second (70 tasks). After scaling, it processes 100 events per second and generates 1 500 tasks, all supported by the event‑driven pipeline.

Page Rendering Optimization

By 2013 Zhihu served millions of page views daily, making rendering both CPU‑ and I/O‑intensive. The team introduced component‑based rendering and a data‑fetch hierarchy, stopping lower‑level requests once upper‑level data was obtained.

They also built a custom template engine called ZhihuNode. These changes reduced the question page load time from 500 ms to 150 ms and the feed page from 1 s to 600 ms.

Service‑Oriented Architecture (SOA)

To manage growing complexity, Zhihu migrated to SOA. The first RPC framework, Wish, used a strict serialization model over a custom STP protocol. As services multiplied, Wish became cumbersome, leading to Snow, which used JSON but lacked strict schema enforcement.

The third framework combined Snow’s simplicity with Apache Avro’s strict schema, offering pluggable transport (STP or binary) and serialization (JSON or Avro). A service registry enabled discovery by name, and a Zipkin‑based tracing system was built for performance monitoring.

Services are organized into three layers—aggregation, content, and foundation—and classified as data, logic, or channel services. Data services handle specialized storage (e.g., images), logic services perform CPU‑intensive tasks (e.g., answer parsing), and channel services act as pass‑throughs (e.g., Sink).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

backendarchitectureScalabilityLoggingdistributed-systemsEvent-driven
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.