Operations 26 min read

Offline Simulation (OSim): Building Unlimited Test Environments for Large‑Scale Services

OSim (Offline Simulation) creates unlimited, production‑like test environments for large‑scale services by using a shared benchmark and branch‑specific instances, routing traffic via colored trace IDs through a sidecar gateway and proxying Redis and MQ data, thus overcoming All‑in‑One bottlenecks and improving stability, automation, and developer productivity.

Didi Tech
Didi Tech
Didi Tech
Offline Simulation (OSim): Building Unlimited Test Environments for Large‑Scale Services

In software development, a testing environment is a critical component that provides a safe, isolated space for developers and QA to verify functionality, performance, and stability.

Early in a product’s lifecycle, a simple All‑in‑One container can serve the whole team, but as the number of dependent services grows to hundreds, this model becomes a bottleneck. The article describes the pain points faced by large companies like Didi when scaling test environments.

Why build an offline simulation environment? A production‑like environment isolated from live traffic enables reliable verification of changes without risking production stability.

All‑in‑One vs. Simulation – All‑in‑One packages all services into a single image, which is easy to spin up but becomes unwieldy at scale. Simulation aims to mirror production as closely as possible (except for network isolation) while keeping maintenance costs comparable to a single extra cluster.

The proposed solution, OSim (Offline Simulation) , introduces a standard baseline environment (the “benchmark”) and allows unlimited branch environments for individual feature testing. Only services that change need a dedicated branch; all other traffic falls back to the benchmark.

Key techniques:

Traffic coloring: embed a branch identifier (e.g., osim100 ) into the trace ID so every service can route requests to the appropriate branch.

Sidecar gateway: a self‑developed sidecar that intercepts traffic, reads the branch tag, and forwards requests to the correct environment (supports HTTP 1.*, Thrift, Dubbo).

Redis proxy: adds a prefix (e.g., osim100_ ) to keys based on the source IP, isolating data per branch without changing client code.

MQ proxy: similar tagging for message queues to ensure branch‑specific consumption.

Images illustrate the architecture:

Operational metrics are defined to monitor stability: (1) automation success rate for releases, and (2) project delays caused by environment issues. Regular post‑mortems, weekly stability reports, and clear responsibility matrices (SRE, business RD, QA, environment FT) ensure continuous improvement.

Future work includes improving local developer experience (IDE integration), visualizing traffic flows, expanding mock capabilities in the sidecar, and enhancing automation coverage.

simulationMicroservicestestingDevOpstraffic isolationoffline environmentSidecar
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.