How Baidu’s Noah TSDB Handles Capacity Management at Scale
This article explains how Baidu’s Noah time‑series database measures, plans, and protects capacity, detailing throughput metrics, estimation and load‑testing methods, and a water‑level model that drives reliable scaling and overload mitigation for massive monitoring workloads.
Overview
Capacity management is a crucial part of system availability operations. It typically consists of three processes: capacity measurement, capacity planning, and overload protection. Measurement defines the metrics that quantify each module’s traffic‑handling ability (rated load, limit load, redundancy) to derive overall system capacity. Planning determines resource needs over time based on SLA availability requirements. Overload protection designs peripheral systems to ensure stable handling of requests that exceed the rated load while still meeting SLA.
The Noah platform’s time‑series database (TSDB) stores monitoring data for many core Baidu services, ingesting trillions of points daily and handling tens of billions of queries. In such a high‑traffic scenario, establishing a robust capacity‑management mechanism—accurately measuring capacity, planning growth, and detecting risks—is both difficult and essential. This article focuses on capacity measurement and planning; future articles will cover overload protection practices.
Capacity Measurement – The Foundation
Defining Capacity Metrics
Capacity is quantified using performance indicators, typically:
Throughput (requests per unit time, e.g., QPS or RPS).
Concurrency (number of simultaneous requests).
Latency (time to complete a request).
In highly concurrent systems, throughput ≈ concurrency ÷ latency; for single‑concurrency systems, throughput is the inverse of latency.
Choosing the appropriate metric depends on the business scenario. For most cases, throughput (QPS/RPS) is used. In Noah’s monitoring context, each request involves multiple item collections, each containing many data points, so we use PPS (Points Per Second) as the TSDB’s QPS metric.
Obtaining Capacity Data
Two methods are used: capacity estimation and load testing. Estimation aggregates runtime metrics of sub‑modules and applies resource‑model calculations to derive system‑wide capacity. For example, if module A receives traffic x and forwards y and z to downstream modules B and C, we estimate capacities X, Y, Z based on resource consumption and traffic conversion, then compute overall capacity as min(X, x·Y/y, x·Z/z). This approach assumes linear resource‑traffic relationships, which may not hold perfectly.
Load testing provides more accurate data and can be performed online or offline. Online testing applies real traffic to the production service, yielding high‑precision results but at higher cost and risk. Offline testing replicates the production environment (or a scaled‑down version) and extrapolates results to the live system. Baidu’s practice combines both: an online test on a hot‑standby cluster that mirrors production scale while keeping the test cluster isolated to minimize risk.
The testing approach considers three aspects:
Load generator: Baidu’s LTP platform replays captured traffic, optionally scaling it or filtering components, to generate realistic high‑concurrency pressure.
Test plan: Ensure isolation between test and production subsystems and define stop‑loss procedures (e.g., abort load generation, switch traffic, or recover clusters) in case of failures.
Data contamination: Tag test data to separate it from real user data, and delete or filter it after the test.
Capacity Modeling – Building a Water‑Level Mechanism
The raw capacity value represents the system’s limit load without redundancy, which cannot guarantee high availability because any degradation or estimation error would invalidate alerts. To mitigate this, a buffer is added: the usable capacity = limit × (1 – buffer). This “rated capacity” forms the basis of capacity reports and water‑level dashboards, which are inspected regularly to spot bottlenecks. Thresholds trigger automatic elastic scaling when the water‑level reaches the warning value, ensuring stable operation.
Capacity Planning
As traffic grows, we must forecast future cluster capacity and the required expansion. Given historical monthly growth X, redundancy requirement R, current limit capacity C, and M current instances, the capacity needed after N months is:
The required number of instances is:
Because actual traffic growth is rarely linear, more sophisticated non‑linear models can be applied.
Conclusion
Capacity management is indispensable for system operations, and accurate capacity data is the foundation for effective planning and optimization. The experiences shared here reflect Baidu’s practical approaches to capacity measurement, planning, and modeling for the Noah TSDB.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.