Operations 23 min read

Evolution of Zhaozhuan Test Environment Governance: From Physical Isolation to Tag‑Based Traffic Routing

This article describes how Zhaozhuan’s testing environment evolved through three versions—physical isolation, automatic‑IP‑tag routing, and manual‑tag routing—detailing the architectural background, implementation principles, advantages, drawbacks, and supporting tools that dramatically reduced deployment time and resource consumption while introducing new operational challenges.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Evolution of Zhaozhuan Test Environment Governance: From Physical Isolation to Tag‑Based Traffic Routing

1. Background and Requirements

Zhaozhuan’s system originally used a monolithic architecture with a single web service per node behind Nginx load balancing. As concurrency grew, the architecture shifted to micro‑services, making precise request routing to specific test nodes more complex.

1.1 Evolution of System Architecture

Monolithic architecture could easily direct traffic to a specific node by adjusting Nginx upstream or using direct IP:port. In micro‑service architecture, multiple services (A, B, C) form a longer chain, and simple upstream changes cannot target downstream services individually.

1.2 Testing Environment Requirements

Unlike production where all nodes run identical code, testing involves multiple parallel branches; each node may run different logic, requiring requests to be precisely routed to the intended service instance.

2. Traditional Solution – Physical Isolation

Physical isolation provides a completely separate test environment per requirement, containing all services, a registry, and MQ broker. While simple for a small number of services, it wastes resources when the system scales to hundreds of services.

3. Zhaozhuan Test Environment V1 – Improved Physical Isolation

3.1 Stable Environment

A stable environment mirrors production with all services. Test environments do not use a service registry; each service is assigned a unique domain name and host file entries are manually edited. For example, the stable service A at 192.168.1.1 is mapped in every test host file as 192.168.1.1 A.zhuaninc.com .

3.2 Dynamic Environment

Each requirement gets a dynamic environment on a KVM VM (e.g., IP 192.168.2.1 ). When deploying service A in this environment, the host entry for the stable IP is overridden to 127.0.0.1 A.zhuaninc.com , ensuring traffic reaches the dynamic instance.

Request a dynamic environment (e.g., 192.168.4.1 ) and receive a full host file mapping to stable services.

Deploy service E' and write 127.0.0.1 to its host entry.

Deploy services D, C, B, A' sequentially, each writing 127.0.0.1 to the host file.

Deploy the Entry service.

Deploy Nginx and modify service A’s upstream to point only to 127.0.0.1 .

This creates a single‑branch pipeline from service E to Nginx, allowing precise routing by host mapping.

3.3 Advantages and Disadvantages

Advantages

Strong isolation, similar to physical isolation.

Simplified link; traffic stays on one machine.

Disadvantages

Requires deploying all services from Nginx to the last tested service, leading to resource waste.

Deployment depends on service call relationships, causing low efficiency and potentially days of debugging.

Complex host management and error‑prone IP‑prefixed topics.

Limited memory on a single machine restricts long chains.

4. Zhaozhuan Test Environment V2 – Automatic IP‑Tag Traffic Routing

To reduce the number of services per dynamic environment (30‑60 → single‑digit) and cut setup time (hours → 30 min‑1 h), an automatic IP‑tag routing solution was introduced. Tags are derived from the VM’s IP, requiring no manual labeling.

Benefits: faster provisioning, fewer services per environment, but still suffers from VM provisioning latency and KVM memory limits.

5. Zhaozhuan Test Environment V3 – Manual‑Tag Traffic Routing

After dockerizing services to eliminate KVM memory constraints, IP‑based tagging became ineffective because each container has a distinct IP. Manual tags are therefore applied.

5.1 Dockerization

Services run in Docker containers, removing the need for pre‑allocated VM resources and eliminating memory caps.

5.2 Service and Traffic Tagging

When requesting an environment, a tag (e.g., yyy ) is assigned. The platform automatically adds a JVM argument -Dtag=yyy to each service. HTTP requests carry the tag via a header tag=yyy . Internal calls inherit the tag automatically.

5.3 Target Shape

Only services that need modification are deployed in the dynamic environment; all other services remain in the stable environment.

5.4 RPC Implementation

Service Registration, Discovery, and Invocation

Services register their tag with the registry. When service A calls B, it discovers all B instances (stable, dynamic with tag yyy , dynamic with tag xxx ) and selects the one whose tag matches the current request.

Tag Propagation

The custom RPC framework transmits the tag via an attachment field.

5.5 MQ Message Implementation

Consumption Principle

Both dynamic and stable environments share the same topic but use different consumer groups. Dynamic groups prepend ${tag} , stable groups prepend test_ . The MQ client adds the prefix automatically.

Issues

If a dynamic consumer goes offline, messages may be lost due to offset mismatches; the solution is to replay missed messages. Duplicate consumption is acceptable because RocketMQ guarantees at‑least‑once delivery.

Tag Transmission

RocketMQ’s extensible headers carry the routing tag.

5.6 In‑Process Tag Transmission

ThreadLocal

Standard ThreadLocal cannot cross new threads or thread pools; InheritableThreadLocal cannot cross thread pools either.

TransmittableThreadLocal

Alibaba’s open‑source TransmittableThreadLocal (via Java agent) enables transparent tag propagation across threads and thread pools.

5.7 Auxiliary Facilities

Wildcard Domain Resolution

Instead of configuring host entries, domains can embed the tag (e.g., app-${tag}.test.zhuanzhuan.com ) and resolve directly to the test Nginx.

Web Shell

A web‑based shell allows one‑click login to the container’s log directory without manual IP entry.

Debug Plugin

The plugin reads the service name and tag, queries the environment platform for the debug port, and automatically connects to the correct container.

6. Distributed Call Tracing System

To diagnose routing failures (e.g., D → E' missing logs), a tracing system records entry and exit points of each module, generating spans with TraceId and SpanId. Zhaozhuan’s system combines a custom client (Radar) that pushes spans to a Collector, which writes to Kafka; Zipkin consumes the data and provides a UI.

TraceId is injected into MDC via SLF4J, printed in logs, and returned to the front‑end via HTTP headers for quick lookup.

At key routing nodes, both the traffic tag ( global.route.context.tag ) and the instance tag ( global.route.instance.tag ) are recorded, allowing verification of correct routing.

7. Summary

Zhaozhuan’s test environment governance progressed through three versions: physical isolation, automatic‑IP‑tag routing, and manual‑tag routing. Physical isolation required days and 30‑60 services per environment; automatic IP tagging reduced this to 7‑8 services and ~30 min‑1 h setup; manual tagging further cut services to 3‑4 and setup to 2‑5 min, saving ~65% memory.

While tag‑based routing improves efficiency and reduces resource usage, it introduces complexities such as longer link chains, changing IPs, and the need for supporting tools like tracing, wildcard DNS, web shell, and debug plugins. Overall, the project received company awards for cost reduction and operational efficiency.

cloud-nativeMicroservicesoperationstraffic routingtest environmentservice governance
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.