Cloud Native 17 min read

OBC: A Cloud-Native Real-Time Computing Engine for Metrics at Didi

To replace costly, duplicated Flink jobs, Didi built Observe‑Compute (OBC), a cloud‑native, PromQL‑driven real‑time metric engine with centralized policy management, scalable containerized workers, and zero‑downtime scaling, achieving million‑RMB annual savings while handling 10 M points per second.

Didi Tech
Didi Tech
Didi Tech
OBC: A Cloud-Native Real-Time Computing Engine for Metrics at Didi

At Didi, the observability platform's metrics data required real-time calculations, which were handled by multiple Flink jobs, each tailored to different service-specific metric calculations. This led to repeated construction of generic metric calculation capabilities, hard‑coded processing logic, operational overhead from Flink job restarts, and high platform costs.

To address these issues, Didi developed an in‑house real‑time computing engine called observe‑compute (OBC). OBC aims to provide a universal real‑time computing engine for metrics, using PromQL as the task description language, enabling flexible policy‑driven task control, calculation‑chain traceability, and cloud‑native containerized deployment that supports zero‑downtime scaling.

The engine consists of three core components: obc‑ruler (service registration/discovery and policy management), obc‑distributor (metrics ingestion from a message queue, policy matching, and forwarding to workers), and obc‑worker (the actual metric calculation unit that follows an execution plan and writes results to persistent storage).

Key features already implemented include: using PromQL to describe streaming tasks, policy configuration that takes effect in real time with human‑intervenable execution plans, calculation‑chain tracing at the policy level, and cloud‑native containerization allowing dynamic scaling without downtime. The calculation‑chain tracing feature is still under development.

OBC has been running stably in production for several months, with the core Flink metric jobs migrated to OBC, yielding an estimated annual cost saving of 1 million RMB. Availability mechanisms such as the cutover time concept, heartbeat‑based worker‑ruler synchronization (3 s intervals), hash‑ring version retention (up to 10 min), and graceful restart handling via SIGTERM help limit metric breaks to at most three points per worker failure while avoiding mis‑calculations.

Policy management in obc‑ruler is built in layers: loaders pull external policy sources, parsers (including a PromQL parser) produce execution‑plan trees, an optimizer merges operators, and a multi‑version manager schedules policy updates and quarantines anomalous policies. The distributor builds two‑level indexes on the __name__ and __ns__ labels to efficiently match high‑throughput metric streams (≈10 M points/s, ~12 K active policies).

The worker aligns event times to a resolution, selects a worker instance via a hash ring using plan‑id, aligned time, and label values, and executes actions derived from PromQL (functions, binary ops, aggregations) as the smallest computational units. Window sizes are set based on raw metric step (≤10 s → 25 s; >10 s → min(2*step+5, 120 s)).

Although current OBC does not yet support PromQL range vectors or certain modifiers (offset, @, subquery), it extends the engine with a custom percentile aggregation operator to merge per‑instance histogram buckets for accurate cluster‑level latency percentiles.

Future work includes moving preprocessing to the collection end (采集端) to achieve a “collect‑and‑compute” pipeline, further reducing latency and computational cost.

cloud-nativeobservabilitymetricsReal-Time ComputingFlink alternativeOBCpromql
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.