Cloud Native 17 min read

OBC: A Cloud-Native Real-Time Computing Engine for Metrics at Didi

To replace costly, duplicated Flink jobs, Didi built Observe‑Compute (OBC), a cloud‑native, PromQL‑driven real‑time metric engine with centralized policy management, scalable containerized workers, and zero‑downtime scaling, achieving million‑RMB annual savings while handling 10 M points per second.

Didi Tech

Sep 21, 2023

OBC: A Cloud-Native Real-Time Computing Engine for Metrics at Didi

At Didi, the observability platform's metrics data required real-time calculations, which were handled by multiple Flink jobs, each tailored to different service-specific metric calculations. This led to repeated construction of generic metric calculation capabilities, hard‑coded processing logic, operational overhead from Flink job restarts, and high platform costs.

To address these issues, Didi developed an in‑house real‑time computing engine called observe‑compute (OBC). OBC aims to provide a universal real‑time computing engine for metrics, using PromQL as the task description language, enabling flexible policy‑driven task control, calculation‑chain traceability, and cloud‑native containerized deployment that supports zero‑downtime scaling.

The engine consists of three core components: obc‑ruler (service registration/discovery and policy management), obc‑distributor (metrics ingestion from a message queue, policy matching, and forwarding to workers), and obc‑worker (the actual metric calculation unit that follows an execution plan and writes results to persistent storage).

Key features already implemented include: using PromQL to describe streaming tasks, policy configuration that takes effect in real time with human‑intervenable execution plans, calculation‑chain tracing at the policy level, and cloud‑native containerization allowing dynamic scaling without downtime. The calculation‑chain tracing feature is still under development.

OBC has been running stably in production for several months, with the core Flink metric jobs migrated to OBC, yielding an estimated annual cost saving of 1 million RMB. Availability mechanisms such as the cutover time concept, heartbeat‑based worker‑ruler synchronization (3 s intervals), hash‑ring version retention (up to 10 min), and graceful restart handling via SIGTERM help limit metric breaks to at most three points per worker failure while avoiding mis‑calculations.

Policy management in obc‑ruler is built in layers: loaders pull external policy sources, parsers (including a PromQL parser) produce execution‑plan trees, an optimizer merges operators, and a multi‑version manager schedules policy updates and quarantines anomalous policies. The distributor builds two‑level indexes on the __name__ and __ns__ labels to efficiently match high‑throughput metric streams (≈10 M points/s, ~12 K active policies).

The worker aligns event times to a resolution, selects a worker instance via a hash ring using plan‑id, aligned time, and label values, and executes actions derived from PromQL (functions, binary ops, aggregations) as the smallest computational units. Window sizes are set based on raw metric step (≤10 s → 25 s; >10 s → min(2*step+5, 120 s)).

Although current OBC does not yet support PromQL range vectors or certain modifiers (offset, @, subquery), it extends the engine with a custom percentile aggregation operator to merge per‑instance histogram buckets for accurate cluster‑level latency percentiles.

Future work includes moving preprocessing to the collection end (采集端) to achieve a “collect‑and‑compute” pipeline, further reducing latency and computational cost.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability Real‑Time Computing Flink alternative OBC PromQL

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.