Operations 26 min read

Offline Simulation (OSim): Building Unlimited Test Environments for Large‑Scale Services

OSim (Offline Simulation) creates unlimited, production‑like test environments for large‑scale services by using a shared benchmark and branch‑specific instances, routing traffic via colored trace IDs through a sidecar gateway and proxying Redis and MQ data, thus overcoming All‑in‑One bottlenecks and improving stability, automation, and developer productivity.

Didi Tech

Aug 29, 2023

Offline Simulation (OSim): Building Unlimited Test Environments for Large‑Scale Services

In software development, a testing environment is a critical component that provides a safe, isolated space for developers and QA to verify functionality, performance, and stability.

Early in a product’s lifecycle, a simple All‑in‑One container can serve the whole team, but as the number of dependent services grows to hundreds, this model becomes a bottleneck. The article describes the pain points faced by large companies like Didi when scaling test environments.

Why build an offline simulation environment? A production‑like environment isolated from live traffic enables reliable verification of changes without risking production stability.

All‑in‑One vs. Simulation – All‑in‑One packages all services into a single image, which is easy to spin up but becomes unwieldy at scale. Simulation aims to mirror production as closely as possible (except for network isolation) while keeping maintenance costs comparable to a single extra cluster.

The proposed solution, OSim (Offline Simulation) , introduces a standard baseline environment (the “benchmark”) and allows unlimited branch environments for individual feature testing. Only services that change need a dedicated branch; all other traffic falls back to the benchmark.

Key techniques:

Traffic coloring: embed a branch identifier (e.g., osim100) into the trace ID so every service can route requests to the appropriate branch.

Sidecar gateway: a self‑developed sidecar that intercepts traffic, reads the branch tag, and forwards requests to the correct environment (supports HTTP 1.*, Thrift, Dubbo).

Redis proxy: adds a prefix (e.g., osim100_) to keys based on the source IP, isolating data per branch without changing client code.

MQ proxy: similar tagging for message queues to ensure branch‑specific consumption.

Images illustrate the architecture:

Operational metrics are defined to monitor stability: (1) automation success rate for releases, and (2) project delays caused by environment issues. Regular post‑mortems, weekly stability reports, and clear responsibility matrices (SRE, business RD, QA, environment FT) ensure continuous improvement.

Future work includes improving local developer experience (IDE integration), visualizing traffic flows, expanding mock capabilities in the sidecar, and enhancing automation coverage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Simulation testing DevOps traffic isolation offline environment sidecar

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.