Capacity Estimation Methodology for Growing Services
The article presents a systematic capacity‑estimation methodology that links service traffic to order volume, uses CPU‑Idle as a primary metric, predicts traffic growth and upper‑bound limits, validates predictions with load‑testing, and provides scaling recommendations while noting limitations of the CPU‑Idle baseline.
The article introduces a systematic method for estimating service capacity limits and making scaling decisions when business volume continuously grows, moving beyond reliance on experience.
Background : Using a Didi internal service as an example, the author shows BI monitoring data, traffic peaks, and low‑traffic periods, highlighting a long‑term upward trend in order volume and service traffic.
Key Questions addressed include: (1) How high can traffic peaks become as order volume grows? (2) What is the resource capacity ceiling as traffic increases? (3) When and how much should a service be scaled?
Capacity Estimation Idea : The hypothesis is that service traffic is proportional to business order volume. By converting the x‑axis of traffic graphs to order volume, a strong correlation can be observed, allowing traffic growth to be predicted.
Empirical analysis shows that CPU Idle (or CPU usage) correlates strongly with service traffic for most application‑type services, making it a primary metric for capacity assessment.
Evaluating Capacity Upper Bound : When a key resource metric (e.g., CPU Idle) drops to a predefined baseline during traffic peaks, the service is considered to have reached its capacity limit. The baseline can be defined by observed error‑rate spikes, SLA violations, or crashes. In practice, a CPU Idle of 40% is used as a heuristic baseline.
Predicting Service Traffic : Business growth forecasts (e.g., a 50% increase in order volume over six months) can be translated into traffic forecasts using the established traffic‑order relationship. The method also accounts for multi‑factor influences such as driver availability.
Algorithm Description : Continuous online data collection, preprocessing, and format conversion feed predictive models. Multiple algorithms are evaluated and selected per service, with holiday peak traffic used for secondary validation. A bootstrap 95% confidence interval provides a range rather than a single point estimate.
Prediction Accuracy : Validation against load‑testing (sentinel) experiments shows an approximate 89% accuracy in predicting traffic‑capacity relationships. No sudden crashes were observed when CPU Idle reached the 40% baseline.
Case Studies :
Case 1: A 10‑node cluster with a current peak of 1000 QPS. Predicted capacity upper bound is 1500‑1700 QPS; with a 30% order increase and 10% driver increase, peak traffic could reach 2000‑2200 QPS, requiring 4‑5 additional nodes.
Case 2: A 5‑container service with a current peak of 250 QPS. Predicted capacity upper bound is 1100‑1200 QPS; with similar growth, peak traffic could be 450‑500 QPS, suggesting a reduction of 2 containers.
Case 3: Summary of multiple core services showing capacity limits, traffic trends, and recommended scaling actions.
Limitations : The method relies on CPU Idle as a baseline, which may not suit all service types. Baseline selection is based on limited data and expert experience; additional validation (full‑stack or sentinel load testing) is required for edge cases.
Performance jitter in downstream services can cause cascading failures; tolerance levels should be combined with circuit‑breaker mechanisms. The capacity estimation results serve as a reference rather than an absolute guarantee.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.