Big Data 17 min read

Real-Time Data Development Practices and Component Selection at Didi

Didi’s unified real‑time data stack outlines best‑practice component choices for four key scenarios—metric monitoring, BI analysis, online services, and feature/tag systems—detailing pipelines from source to sink, resource‑usage guidelines, and a one‑stop development platform to build stable, high‑performance streaming solutions.

Didi Tech
Didi Tech
Didi Tech
Real-Time Data Development Practices and Component Selection at Didi

With the continuous unification of Didi's internal technology stack and the integration of real‑time related components, a set of best‑practice technology selections and concrete implementation solutions has emerged for various business scenarios. However, many real‑time developers still equate real‑time data construction solely with Flink development, overlooking other essential components. This article organizes technology selections for different scenarios to help teams build efficient, stable real‑time data pipelines.

The article is divided into the following parts:

Major business scenarios of real‑time data development within the company

General solution for real‑time data development

Component selection for specific scenarios

Principles for using each component

Summary and outlook

1. Major Business Scenarios

Four typical real‑time use cases are identified:

Real‑time metric monitoring : High timeliness, time‑series‑driven, moderate analysis complexity (e.g., production‑research metric stability, business‑side anomaly detection, operation‑dashboard health monitoring).

Real‑time BI analysis : Requires very high accuracy, tolerates slight latency, supports complex analytical queries (e.g., operational dashboards, real‑time core boards, large‑screen displays).

Real‑time data online service : Provides metrics via API, demanding high QPS, high timeliness and accuracy, moderate computational complexity.

Real‑time features and tags : Feeds machine‑learning models, recommendation systems, and labeling pipelines; needs strong state handling, external connector support, and robust real‑time compute capabilities.

2. General Solution

The end‑to‑end pipeline consists of six parts: data source, data channel, sync center, real‑time compute, real‑time storage, and real‑time application.

Data Source : Primarily MySQL binlog captured by the open‑source Canal tool and business server public logs collected by LogAgent and sent to Kafka.

Data Channel : DDMQ (an internal product built on RocketMQ and Kafka) and Kafka. DDMQ supports delayed and transactional messages; Kafka offers high scalability and seamless Flink integration.

Sync Center : Offline sync uses DataX to write to HDFS; real‑time sync uses Dsink to push data to Kafka and optionally to HDFS for incremental ODS tables.

Real‑time Development Platform : All real‑time jobs are built on the ShuMeng one‑stop data development platform, supporting Flink JAR and Flink SQL (≈92% SQL, 8% JAR as of June 2022). Local debugging is encouraged to reduce runtime errors. Typical workloads are ETL or lightweight aggregations, with results written to downstream sinks.

Data Set (Sinks) : Kafka, Druid, ClickHouse, MySQL, HBase, Elasticsearch, Fusion, etc. Metrics for monitoring go to Druid; BI dashboards use ClickHouse; API services may write to MySQL/HBase/ES/Fusion.

3. Component Selection for Specific Scenarios

Real‑time Metric Monitoring : Emphasizes timeliness and high QPS. Recommended pipeline includes Canal → DDMQ → Flink → Druid, with hyperUnique metrics for UV‑type calculations and partitioned Druid topics for efficient consumption.

Real‑time BI Analysis : Requires high accuracy and flexible dimension‑metric queries. The typical solution flattens dimensions in Flink, writes micro‑batches to ClickHouse local tables, and uses ReplacingMergeTree or AggregatingMergeTree engines for deduplication and pre‑aggregation.

Real‑time Online Service : Focuses on high QPS and accuracy. Two patterns: (1) Flink computes metrics and writes directly to MySQL/HBase for API consumption; (2) Flink performs lightweight aggregation, while downstream OLAP engines (e.g., ClickHouse) handle final metric calculations.

Real‑time Feature and Tag System : Requires high QPS and large state. Options include consuming topics directly for feature services or using HBase/Fusion for strong primary‑key updates, then exposing features/tags via a service layer.

4. Resource Usage Principles

Data Collection : Single‑source principle – reuse upstream sources for both real‑time and offline ODS layers.

DDMQ : One Flink job per consumer group; avoid sharing a group across jobs.

Kafka : Limit a single partition to ≤3 MB/s; retain data 48–72 h (≥2 days) for back‑fill.

Flink : Align source parallelism with Kafka/DDMQ partitions; default TM resources: slot = 2, taskManagerMemory = 4096 MB, vCores = 2. Adjust slots for pure ETL, increase memory for heavy operators.

Druid : Set aggregation granularity to 30 s or 1 min; retain data ~3 months; use String type for dimensions and appropriate metric types for performance.

ClickHouse : Minimum write interval 30 s; parallelism ≤10; partition by time; use native Flink‑to‑CK connector; keep throughput ≤20 MB/s per parallelism to ensure cluster stability.

5. Summary and Outlook

The article consolidates Didi's real‑time development practices, offering a practical guide for teams transitioning from offline to streaming pipelines. By presenting four typical scenarios—metric monitoring, BI analysis, online services, and feature/tag systems—it clarifies component choices and usage principles, helping developers design cost‑effective, high‑performance real‑time solutions.

Big DataFlinkstream processingKafkaClickHousereal-time dataDruid
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.