Exploring Elastic Capacity and Automated Scaling Architecture at Dada Group
This article presents Dada Group's comprehensive approach to elastic capacity management and automated scaling, detailing the challenges faced during traffic spikes, the design of a cloud‑native auto‑scaler, multi‑metric observability, decision‑making logic, execution mechanisms, extreme scaling practices, and future optimization directions.
Facing massive order surges during holidays, promotional events, and pandemic‑driven online grocery demand, Dada Group adopted an intelligent elastic scaling architecture and fine‑grained capacity management to ensure end‑to‑end delivery reliability while achieving cost savings during low‑traffic periods.
The article recounts a 2019 incident where manual scaling failed to keep up with a sudden spike in flower‑order traffic, leading to CPU saturation, service timeouts, and a 30‑minute recovery delay, highlighting the need for automated capacity planning.
To enable automated scaling, Dada introduced Apollo for configuration management, Consul for service discovery, and an OpenResty+Consul gateway for hot‑updating upstream nodes, establishing a stateless environment.
The first version of the AutoScaler was built using Falcon alerts and elastic configuration, defining a minimum instance count and scaling rules such as:
Minimum instances: default 2, adjustable.
Scale‑up: if cluster CPU >30% add 50% more instances; if >50% double instances.
Scale‑down: if cluster CPU <5% trigger StackStorm to recycle half of the instances, never dropping below the minimum.
Recognizing that CPU alone is insufficient, the system was extended to monitor queue backlogs, connection counts, disk I/O, error logs, and response latency, requiring a more flexible elastic architecture.
The elastic design follows a three‑stage pipeline:
Perception : ingest metrics from Falcon, InfluxDB, Loki, Prometheus, and OpenTSDB, covering CPU, memory, disk, network, and custom service metrics.
Decision : core scaling engine comprising Configuration (rule entry), Dashboard (real‑time view), Notification (WeChat and email alerts), Aggregator (metric aggregation), Collector (metric conversion), Judge (decision engine based on a HPA‑like algorithm), TSA (time‑series analysis), CMDB+Consul (service metadata), and Cache (historical data).
Execution : Dispatch module handles concurrent scaling actions, retries, and audit logs; Providers module abstracts scaling APIs for VM, Tars, Kubernetes, and serverless platforms.
The scaling formula used by the decision engine is desiredInstances=ceil(currentInstances*(currentMetricValue/desiredMetricValue)) , and custom SQL queries can be applied, for example: SELECT sum("value") FROM "*_queue_*" WHERE ("queue" = 'XXXX') AND time >= now() - 5m GROUP BY time(1m) fill(null) .
Practical deployment revealed several lessons: regular scaling drills, handling multi‑AZ distribution, preventing over‑scaling with hard limits, monitoring downstream dependencies, and tracking success rate, efficiency, and cost trends.
Extreme scaling (aggressive down‑scaling during off‑peak hours) adjusts the minimum instance count for specific time windows and restores it later, requiring reliable provisioning and fast VM initialization.
Support for multiple runtimes (VM, VM+Container, Kubernetes, serverless) was achieved with a custom init process (dinit) that manages multi‑process containers, graceful shutdown, and in‑place restarts.
After nearly 20 months of stable operation, the AutoScaler continues to evolve, with future work on predictive scaling using Facebook Prophet and enhanced anomaly detection for self‑healing capabilities.
Dada Group Technology
Sharing insights and experiences from Dada Group's R&D department on product refinement and technology advancement, connecting with fellow geeks to exchange ideas and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.