Operations 16 min read

Design and Implementation of Full‑Link Load Testing at Dada Group

This article describes Dada Group’s evolution from a simple 1:1 test environment to a sophisticated machine‑labeling load‑testing solution, detailing core design, isolation techniques, custom testing platform, model construction, pre‑heat strategies, and post‑test analysis that ensure system stability during high‑traffic events.

High Availability Architecture

Jun 2, 2021

Design and Implementation of Full‑Link Load Testing at Dada Group

Background

Dada Group, founded in Shanghai in 2014, is a local instant‑delivery platform that has steadily grown over seven years. During the 2020 Double‑11 promotion, daily completed orders exceeded ten million. To guarantee stability for such volume, full‑link load testing plays a crucial role in Dada’s reliability assurance.

2. Core Design of Full‑Link Load Testing

2.1 Industry Practices

The traditional approach builds a 1:1 test environment mirroring production, which is simple but incurs high human and machine costs as scale grows. The industry‑wide “traffic labeling” method tags requests (HTTP, RPC, MQ) so that labels travel across services, isolating test traffic from production. Data isolation is achieved by:

Using shadow databases/tables at the DB layer.

Using shadow caches at the cache layer.

Using shadow queues at the MQ layer.

However, Dada’s heterogeneous middleware (different ORM versions, Java and Python services) makes large‑scale traffic labeling impractical, leading to its abandonment.

2.2 Dada’s Machine‑Labeling Solution

After analyzing Dada’s architecture, a “machine labeling” solution was developed in Q1 2019, employing shadow DB, Redis, and MQ for data isolation. The implementation process includes:

Abstracting all DB, Redis, and ES machines into individual nodes and registering node information to a service registry.

Integrating every service with a “link‑governance SDK” that can route requests based on link categories.

At runtime, the SDK registers service nodes, fetches link‑specific storage node info, and establishes connections to DB, Redis, and MQ.

The final production environment forms two parallel pipelines on the machine dimension: one for normal production traffic and one for load‑testing traffic.

2.3 Load‑Testing Platform

The original solution used JMeter, which, while stable and distributed, lacked flexibility for complex scenarios. Dada therefore built a custom platform on top of the JMeter kernel. The platform consists of four core modules:

Frontend Service : Provides task creation, start/stop, and result visualization.

Task Parser : Parses and stores load‑testing tasks.

Load‑Testing Engine : Schedules tasks to executors (immediate or timed).

Result Processor : Parses responses, aggregates metrics, handles exceptions, and generates reports.

The platform offers a visual UI, real‑time metric display (TPS, latency, error rate), and allows engineers to configure performance parameters and data‑generation scripts directly.

2.4 Solution Comparison

A comparison between “traffic labeling” and “machine labeling” shows that Dada chose machine labeling due to lower system‑modification cost and better security.

2.5 Load‑Testing Platform Architecture

3. Full‑Link Load‑Testing Execution

The execution is divided into three phases: pre‑test, during test, and post‑test.

3.1 Link Grooming

Accurate link grooming determines which services need to be deployed and monitored. Initially, manual grooming was used, which was inefficient. Dada later adopted APM (PinPoint) to automatically discover service dependencies.

To detect real‑time link changes, a periodic request is sent in the development environment; any dependency change triggers an alert.

3.2 Optimization Plans

Before testing, performance‑optimization plans are prepared, such as:

Thread/Connection Pool Saturation : Scale pools or services when CPU is not the bottleneck.

MySQL Master‑Slave Lag : Binlog tuning, hardware upgrade, vertical/horizontal sharding.

Redis Bandwidth Saturation : Automatic bandwidth scaling.

MQ Message Accumulation : Scale consumer services.

For MySQL lag, two key parameters are tuned: binlog_group_commit_sync_delay and binlog_group_commit_sync_no_delay_count. Tuning improves latency but may increase response time, which must be acceptable to the business.

3.3 Fine‑Grained Load‑Testing Model

The model evolved from a simple virtual‑rider TPS target to a realistic simulation of active riders, incorporating time and space dimensions. It consists of:

Data Model : Rider, merchant, and order data imported into shadow databases after sanitization.

Traffic Model : Order dispatch and delivery flow.

For traffic generation, Dada uses handcrafted traffic rather than pure replay, because the core business is write‑heavy. The traffic reflects three daily peaks (morning, noon, evening) with different interface load patterns.

Geohash is used to divide the country into grids, counting orders and available riders per grid to reproduce hotspot distribution in the test environment.

3.4 Pre‑Heat

Early tests showed higher latency because production data was hot while test data was cold. Introducing a pre‑heat phase loads a portion of data into cache, making test latency comparable to production.

3.5 During Test

The test includes interface validation, pre‑heat, execution, metric observation, and issue logging. Real‑time TPS, latency, and error rates are displayed on the frontend.

3.6 Post‑Test

After the test, reports are generated, performance bottlenecks are identified, capacity is estimated, and a retrospective is performed. The retrospective compares production vs. test TPS, latency, and core middleware metrics to validate and refine the model.

4. Summary and Benefits

Since Q1 2019, Dada has completed four major‑promotion load‑testing cycles, achieving three key successes:

Traffic Isolation : Machine‑labeling fully separates test traffic from production.

Data Isolation : Shadow DB, cache, and queues ensure production data safety and allow testing at any time.

Fine‑Grained Model : Time‑ and space‑aware traffic simulation closely mirrors real‑world peaks, improving test accuracy.

Benefits include:

Stability : Each promotion uncovered over ten performance issues, ensuring smooth operation.

Efficiency : Machine costs reduced by 40% and productivity increased by 65% compared to the previous isolated‑environment approach.

Future work focuses on balancing security and cost (e.g., shadow‑table vs. shadow‑DB) and exploring how full‑link testing can inform intelligent fleet scheduling.

Authors

Xu Jiankang – Senior Engineer, responsible for Dada’s 2019‑2020 promotion load tests.

Gu Baowan – Architect, responsible for micro‑service governance and high‑availability data sources.

Zhang Peng – Senior Test Engineer, responsible for testing Dada’s instant‑delivery system.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Microservices system reliability traffic isolation load-testing performance engineering

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.