Backend Development 10 min read

Zhihu's Early Architecture and Evolution: Backend Development, Distributed Logging, Event‑Driven Design, and Service‑Oriented Architecture

The article chronicles Zhihu's growth from a two‑engineer startup using Python and Tornado on a single Linode server to a large‑scale backend employing high‑availability MySQL, Redis sharding, a custom distributed logging system (Kids), event‑driven processing with Sink and Beanstalkd, component‑based page rendering via ZhihuNode, and a multi‑layer SOA built on evolving RPC frameworks.

Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Zhihu's Early Architecture and Evolution: Backend Development, Distributed Logging, Event‑Driven Design, and Service‑Oriented Architecture

Zhihu, the third‑largest Chinese UGC community after Baidu Tieba and Douban, grew from zero to over 100 servers, with more than 11 million registered users and monthly page views exceeding 2.2 billion.

In October 2010 the product started with only two engineers, expanding to four by launch. The core language was Python for its simplicity and strong community, and the Tornado framework was chosen for its asynchronous support, fitting Zhihu's need for long‑lived comet connections.

Initially the team used a 512 MB Linode VM to save costs, but rapid user growth caused performance and latency issues, leading them to purchase their own machines and colocate them. Early hardware failures prompted the implementation of web and database high‑availability setups.

The architecture diagram from that period shows a master‑slave setup for both web and database layers, read‑write splitting, an offline‑script server to avoid impacting online latency, and upgraded internal networking that increased throughput twenty‑fold.

By early 2011 Zhihu heavily relied on Redis for queues, search, and caching; single‑node limits led to sharding and consistency mechanisms.

The team emphasized tooling, using profiling, Werkzeug, Puppet, and a deployment tool called Shipit to improve efficiency.

To handle the need for a distributed log collection system that was real‑time, centralized, and subscribable, Zhihu built an in‑house solution named Kids (Kids Is Data Stream). Kids follows Scribe’s model, allowing each server to act as an Agent or Server; Agents gather messages from applications and forward them to other Agents or a central Server, from which subscribers can retrieve logs.

Kids also powers a web tool called Kids Explorer for real‑time log viewing, and the project has been open‑sourced on GitHub.

As feature complexity grew, Zhihu adopted an event‑driven architecture. A custom message queue called Sink receives events, persists them locally, and then distributes them. Sink uses the Miller framework to enqueue tasks into Beanstalkd, which manages full‑cycle task processing. For example, when a user answers a question, the answer is stored in MySQL, the event is sent to Sink, which hands it to Beanstalkd, and workers process the subsequent tasks.

Initially the system handled 10 messages per second and 70 tasks; it now processes around 100 events per second and 1,500 tasks, demonstrating the scalability of the event‑driven design.

In 2013, with millions of daily page views, Zhihu optimized page rendering by componentizing the UI and creating a custom template engine called ZhihuNode. This hierarchical data fetching reduced redundant requests, cutting the answer page load time from 500 ms to 150 ms and the feed page from 1 s to 600 ms.

To manage the growing system complexity, Zhihu transitioned to a Service‑Oriented Architecture (SOA). The RPC layer evolved through three generations: Wish (strict serialization with a custom STP protocol), Snow (JSON‑based but loosely defined), and a third framework combining Snow’s simplicity with Apache Avro for flexible yet structured serialization, supporting pluggable transport layers.

A service registry enables discovery by simple service names, and a tracing system built on Zipkin provides observability. Services are organized into three layers—aggregation, content, and foundation—and classified as data, logic, or channel services, with examples such as image storage (data), answer formatting (logic), and Sink (channel).

backend engineeringPythontornadoevent-driven architecturedistributed loggingservice-oriented architecture
Art of Distributed System Architecture Design
Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.