Real-Time Data Development Practices and Component Selection at Didi
Didi’s unified real‑time data stack outlines best‑practice component choices for four key scenarios—metric monitoring, BI analysis, online services, and feature/tag systems—detailing pipelines from source to sink, resource‑usage guidelines, and a one‑stop development platform to build stable, high‑performance streaming solutions.
With the continuous unification of Didi's internal technology stack and the integration of real‑time related components, a set of best‑practice technology selections and concrete implementation solutions has emerged for various business scenarios. However, many real‑time developers still equate real‑time data construction solely with Flink development, overlooking other essential components. This article organizes technology selections for different scenarios to help teams build efficient, stable real‑time data pipelines.
The article is divided into the following parts:
Major business scenarios of real‑time data development within the company
General solution for real‑time data development
Component selection for specific scenarios
Principles for using each component
Summary and outlook
1. Major Business Scenarios
Four typical real‑time use cases are identified:
Real‑time metric monitoring : High timeliness, time‑series‑driven, moderate analysis complexity (e.g., production‑research metric stability, business‑side anomaly detection, operation‑dashboard health monitoring).
Real‑time BI analysis : Requires very high accuracy, tolerates slight latency, supports complex analytical queries (e.g., operational dashboards, real‑time core boards, large‑screen displays).
Real‑time data online service : Provides metrics via API, demanding high QPS, high timeliness and accuracy, moderate computational complexity.
Real‑time features and tags : Feeds machine‑learning models, recommendation systems, and labeling pipelines; needs strong state handling, external connector support, and robust real‑time compute capabilities.
2. General Solution
The end‑to‑end pipeline consists of six parts: data source, data channel, sync center, real‑time compute, real‑time storage, and real‑time application.
Data Source : Primarily MySQL binlog captured by the open‑source Canal tool and business server public logs collected by LogAgent and sent to Kafka.
Data Channel : DDMQ (an internal product built on RocketMQ and Kafka) and Kafka. DDMQ supports delayed and transactional messages; Kafka offers high scalability and seamless Flink integration.
Sync Center : Offline sync uses DataX to write to HDFS; real‑time sync uses Dsink to push data to Kafka and optionally to HDFS for incremental ODS tables.
Real‑time Development Platform : All real‑time jobs are built on the ShuMeng one‑stop data development platform, supporting Flink JAR and Flink SQL (≈92% SQL, 8% JAR as of June 2022). Local debugging is encouraged to reduce runtime errors. Typical workloads are ETL or lightweight aggregations, with results written to downstream sinks.
Data Set (Sinks) : Kafka, Druid, ClickHouse, MySQL, HBase, Elasticsearch, Fusion, etc. Metrics for monitoring go to Druid; BI dashboards use ClickHouse; API services may write to MySQL/HBase/ES/Fusion.
3. Component Selection for Specific Scenarios
Real‑time Metric Monitoring : Emphasizes timeliness and high QPS. Recommended pipeline includes Canal → DDMQ → Flink → Druid, with hyperUnique metrics for UV‑type calculations and partitioned Druid topics for efficient consumption.
Real‑time BI Analysis : Requires high accuracy and flexible dimension‑metric queries. The typical solution flattens dimensions in Flink, writes micro‑batches to ClickHouse local tables, and uses ReplacingMergeTree or AggregatingMergeTree engines for deduplication and pre‑aggregation.
Real‑time Online Service : Focuses on high QPS and accuracy. Two patterns: (1) Flink computes metrics and writes directly to MySQL/HBase for API consumption; (2) Flink performs lightweight aggregation, while downstream OLAP engines (e.g., ClickHouse) handle final metric calculations.
Real‑time Feature and Tag System : Requires high QPS and large state. Options include consuming topics directly for feature services or using HBase/Fusion for strong primary‑key updates, then exposing features/tags via a service layer.
4. Resource Usage Principles
Data Collection : Single‑source principle – reuse upstream sources for both real‑time and offline ODS layers.
DDMQ : One Flink job per consumer group; avoid sharing a group across jobs.
Kafka : Limit a single partition to ≤3 MB/s; retain data 48–72 h (≥2 days) for back‑fill.
Flink : Align source parallelism with Kafka/DDMQ partitions; default TM resources: slot = 2, taskManagerMemory = 4096 MB, vCores = 2. Adjust slots for pure ETL, increase memory for heavy operators.
Druid : Set aggregation granularity to 30 s or 1 min; retain data ~3 months; use String type for dimensions and appropriate metric types for performance.
ClickHouse : Minimum write interval 30 s; parallelism ≤10; partition by time; use native Flink‑to‑CK connector; keep throughput ≤20 MB/s per parallelism to ensure cluster stability.
5. Summary and Outlook
The article consolidates Didi's real‑time development practices, offering a practical guide for teams transitioning from offline to streaming pipelines. By presenting four typical scenarios—metric monitoring, BI analysis, online services, and feature/tag systems—it clarifies component choices and usage principles, helping developers design cost‑effective, high‑performance real‑time solutions.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.