Evolution of Zhihu's Application Deployment System: From Physical Machines to Cloud‑Native Kubernetes
This article details the design and evolution of Zhihu's deployment platform, covering its early physical‑machine system, the transition to container orchestration with Mesos and Kubernetes, and advanced features such as blue‑green, canary releases, pre‑deployment, and branch deployments that enable rapid, reliable continuous delivery for large‑scale internet services.
Application deployment is a critical part of software development, especially for internet companies that need fast iteration and continuous delivery while minimizing change and error costs. This article introduces the evolution of Zhihu's deployment platform from its inception to its current state, offering practical insights.
Zhihu's deployment system, built by the Engineering Efficiency team, serves almost all business services with roughly 2,000 daily deployments. With blue‑green deployment enabled, most production releases complete in under 10 seconds (excluding canary verification).
Supports container and physical‑machine deployments, covering online services, offline services, scheduled tasks, and static files.
Provides office‑network pre‑release capability.
Offers canary verification with fault detection and automatic rollback.
Enables blue‑green deployment with second‑level switch‑over and rollback.
Allows deployment of Merge Request code for debugging.
Technical Background
Before describing the deployment system, a brief overview of Zhihu's infrastructure and network topology is provided.
Zhihu Network Layout
The network is divided into three isolated parts:
Production network: external online servers, fully isolated for security.
Testing network: isolated from production; used for pre‑deployment testing.
Office network: internal staff network that can access both testing and production via jump hosts.
Traffic Management
Zhihu uses Nginx + HAProxy to route traffic. Developers configure locations in Nginx, HAProxy maps traffic to real servers, and also handles load balancing, rate limiting, and circuit breaking.
Continuous Integration
Jenkins + Docker are used for CI; the CI process generates immutable artifacts that serve as the basis for deployments.
Physical‑Machine Deployment
Initially, Zhihu relied on physical‑machine deployments with custom scripts, which were slow, risky, and hard to roll back. Around 2015, the first deployment system named nami (inspired by the One Piece character) was created.
nami used Fabric to upload CI artifacts to physical machines, extracted them, and managed processes with Supervisor.
Application (App) and Service (Unit)
Each GitLab repository corresponds to an application, but a single codebase may run multiple services (e.g., API, scheduled tasks, Celery workers). Users configure start commands, parameters, and environment variables for each Unit via the deployment UI.
Candidate Version
Every deployment is based on a CI‑generated artifact, called a Candidate version. Typically, a Candidate corresponds to a Merge Request.
Deployment Stage
Deployments are split into multiple stages (e.g., Build, Test, Office, Canary 1, Canary 2, Production). Each stage can be set to auto‑deploy, enabling continuous deployment pipelines.
Service Registration and Discovery
Before deploying to a physical machine, the host is removed from Consul; after deployment it is re‑registered. HAProxy configuration is updated via Consul‑Template, and a custom library diplomat pulls service lists from Consul for RPC and other use cases.
Container Deployment
Legacy Container System (Bay)
In late 2015, Zhihu adopted Mesos and built an initial container orchestration system called Bay, which supported rolling updates but could take up to 18 minutes for large groups.
Feature Enhancements
Health checks (/check_health) were added for HTTP/RPC services, and online/offline services were split to use rolling or full‑replace strategies.
Pre‑Release and Canary Release
Office Network Pre‑Release
Traffic from the office network is split at the Nginx layer to a dedicated HAProxy, allowing validation of changes before they reach external users.
Canary Release
Two canary stages (1% and 20% of production containers) were introduced between the office and production stages. Automated canary monitoring compares metrics against production; if anomalies are detected, the canary containers are destroyed and developers are notified. If no issues appear within six minutes, the production stage proceeds.
New Container Deployment
To address Bay's speed and stability issues, the orchestration was migrated from Mesos to Kubernetes, resulting in the new system NewBay , which brings faster deployments and higher reliability.
Blue‑Green Deployment
NewBay implements true blue‑green deployment: new and old container groups coexist, and HAProxy switches traffic atomically, allowing second‑level rollbacks.
Pre‑Deployment
During the canary phase, full‑production containers are started asynchronously so that the final production switch only needs to redirect traffic, reducing total rollout time to seconds.
Branch Deployment
Deployments can also be triggered for Merge Requests, enabling developers to test changes in isolated containers before merging to the main branch.
Platformization of the Deployment System
The entire workflow is encapsulated in Zhihu App Engine (ZAE), a developer platform that provides UI for monitoring deployment progress, logs, and common operations.
Overall, Zhihu's deployment system has matured since 2015, playing a vital role in accelerating business iteration, reducing failures, and shaping the company's product release cadence.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.