Operations 13 min read

Evolution of Zhuanzhuan's Test Environments: From Monolithic Setups to Docker‑Based Dynamic and Stable Environments

This article details how Zhuanzhuan’s testing environment progressed from a handful of static machines to a Docker‑driven dynamic‑and‑stable architecture, addressing resource shortages, stability issues, and operational inefficiencies through IP routing, tag routing, and extensive automation, ultimately achieving significant reductions in resource usage, deployment time, and user‑reported problems.

转转QA
转转QA
转转QA
Evolution of Zhuanzhuan's Test Environments: From Monolithic Setups to Docker‑Based Dynamic and Stable Environments

1 Test Environment Evolution

Testing environments are a core component for any software company. Zhuanzhuan’s testing environment has evolved from a few static setups to a flexible Docker‑based dynamic and stable environment system, adapting to cluster expansion and new business demands.

1.1 Monolithic Environment

In 2017, Zhuanzhuan started with five 64 GB machines forming five complete test environments, sufficient for daily needs. One machine was allocated to developers and the rest to testers, with conflicts resolved through coordination.

1.2 Dynamic + Stable Environments

As micro‑services expanded, parallel branch development increased, and shared environments caused interference. A new model introduced dynamic environments for modified services and stable environments mirroring production. An environment platform managed the full lifecycle from request to reclamation, partially meeting the needs.

Problem: after a request entered the stable environment, calls could not reach services in the dynamic environment, forcing all upstream services, MQ producers, etc., to be deployed on the test machine, dramatically increasing resource consumption as the cluster grew.

1.3 Dynamic + Stable Environments (IP Routing)

To prioritize traffic to the dynamic environment and fall back to stable only when necessary, IP routing was implemented as a lane identifier. This reduced resource usage by about 30 %.

Despite the improvement, issues persisted as hardware shortages and scaling pressures continued.

2 Problems in Environment Usage

Three main trade‑offs emerged: system stability, resource cost, and usage efficiency. Limited procurement prevented retiring old machines, leading to stability problems. Insufficient resources kept test machine utilization high, preventing the stable environment from maintaining a 30 % memory headroom, which in turn hurt stability. Strict reclamation policies also degraded user experience.

2.1 Resource Shortage

Business and cluster growth, combined with procurement delays, left the test pool at 3.8 TB with an 80 % peak usage, and machines with >40 GB memory were hard to obtain.

2.2 Resource Waste

Fixed‑size memory allocations prevented automatic scaling. As services were updated, duplicate containers accumulated in both dynamic and stable environments, and reclaimed resources could not be returned to the pool.

2.3 Stability Issues

Hardware reliability: aged, out‑of‑warranty machines often failed, causing direct business impact.

Deployment complexity: a 7‑8 step initialization process could fail at any stage, and configuration replacements for databases, Redis, MQ, ZK were error‑prone.

Manual host and Nginx adjustments increased the chance of mistakes.

Lack of automatic scaling required manual environment recreation, raising time costs.

KVM‑based solutions had high maintenance overhead.

These issues generated roughly 25 environment‑related tickets per week, consuming about 8 hours of ops time. To mitigate, tools such as error analysis, VM restart, resource alerts, health monitoring, and migration utilities were built.

3 Solution: Dynamic + Stable Environments (Tag Routing)

3.1 Architecture Changes

The platform was redesigned using Docker and stable environments, replacing IP routing with tag routing. An environment now consists of multiple Docker containers and IPs (e.g., environment yyy contains services B and D with IPs 192.168.5.1 and 192.168.6.1).

Image initialization and agent setup were eliminated. Environment size is no longer bound by a single host; a single environment can host all services. Leveraging Kubernetes, a new node is added during deployment and the old one is drained, ensuring zero‑downtime.

Engineering Standardization

RD upgrades switched test configurations to production‑like settings, removing platform‑level config replacements.

Centralized Nginx

Per‑environment Nginx instances were removed; a centralized Nginx managed routing, eliminating generation errors.

Host Configuration Simplification

Unnecessary public hosts were deleted, RPC calls were migrated to a service‑management platform, and remaining hosts were resolved via internal DNS.

New Challenges and Mitigations

Tag routing introduced new concerns: IPs became non‑unique tags, changing with each deployment, affecting host configuration, login, log access, and unit testing. Solutions included wildcard sub‑domains (e.g., app‑${tag}.zhuanzhuan.com), Whistle routing rules, webshell access, historical log queries, and a tag‑based unit‑test helper. An IDEA plugin later addressed remote‑debug IP changes.

New Operational Model

The minimal management node shifted from a KVM host to a service within a tag. After a test service is promoted, the platform syncs the latest code to the stable environment and removes the test tag, reclaiming resources automatically.

Results

User‑reported issues dropped by 95 % and large‑scale tests saw virtually no environment problems.

Application time reduced from 28 minutes to under 5 minutes.

Resource consumption fell from 3200 GB to 1200 GB.

Conclusion

Within one month of design, three months of service upgrades, and a year of full rollout, Zhuanzhuan achieved substantial gains in architecture, operations, and engineering efficiency. Docker‑based environments now provide instant, interruption‑free testing with resource, performance, and efficiency improvements that are considered industry‑leading.

More technical implementation details

About the author

Chen Qiu, Zhuanzhuan Engineering Efficiency Lead, responsible for configuration management and DevOps ecosystem.

DockertestingKubernetesDevOpsenvironment
转转QA
Written by

转转QA

In the era of knowledge sharing, discover 转转QA from a new perspective.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.