Real-time Data Warehouse Construction at Didi: Architecture, Practices, and Lessons
To support Didi’s fast‑growing car‑pool service, a real‑time data warehouse was built using a streamlined layered architecture—ODS, DWD, DIM, DWM, and APP—leveraging Flink‑based StreamSQL, Kafka, Druid and ClickHouse to deliver minute‑level analytics, dashboards, monitoring, and cross‑business interfaces while planning unified meta‑store integration.
Didi’s rapid business growth has driven an increasing demand for timely data. To meet this need, Didi has undertaken extensive experiments and practices in real‑time data warehouse (real‑time DW) construction, using the car‑pool (顺风车) business as a case study.
Purpose of Real‑time DW
With the internet entering its second half, data timeliness becomes crucial for fine‑grained operations. Real‑time DW enables rapid extraction of valuable insights from massive daily data, supporting faster decision‑making, product iteration, and operational adjustments.
Problems of Traditional DW
Traditional warehouses focus on historical data accumulation, often lagging behind real‑time business needs. Real‑time DW aims to combine warehouse theory with streaming technologies to address low data latency, improve data availability, and reduce resource waste.
Key Application Scenarios
Real‑time OLAP analysis using Flink‑based StreamSQL, Kafka, DDMQ, Druid, ClickHouse.
Real‑time dashboards for order and coupon metrics.
Real‑time business monitoring (safety, finance, complaints).
Real‑time data interface services for cross‑business collaboration.
Architecture Overview
The real‑time DW for the car‑pool business follows a layered structure similar to offline warehouses but with fewer layers to reduce latency.
Layer Details
1. ODS (Source Layer)
Data sources include order binlog, public logs, and traffic logs, ingested into Kafka or DDMQ. Naming conventions: cn-binlog- - for auto‑generated topics and realtime_ods_binlog_ /ods_log_ for custom topics.
2. DWD (Detail Layer)
Fact tables are built per business process, with selective dimension redundancy for wide tables. Data is processed via StreamSQL, stored in Kafka and optionally written to Druid for query and aggregation.
Naming pattern: realtime_dwd_ _ _[ _]_ (e.g., realtime_dwd_trip_trd_order_base ).
3. DIM (Dimension Layer)
Consistent dimension tables built using modeling principles, sourced from Flink‑processed ODS data and offline jobs. Storage options include MySQL, HBase, and Didi’s Fusion KV store. Naming pattern: dim_ _ _[ ] (e.g., dim_trip_dri_base ).
4. DWM (Summary Layer)
Aggregated metrics (PV, UV, order counts) are computed per theme, with minute‑level granularity. Druid is used for UV de‑duplication, while custom aggregation logic runs in Flink. Naming pattern: realtime_dwm_ _ _ _ _ (e.g., realtime_dwm_trip_trd_pas_bus_accum_1min ).
5. APP (Application Layer)
Real‑time summary data is written to downstream systems such as Druid for dashboards, HBase for interface services, and MySQL/Redis for product features. No strict naming constraints are imposed.
StreamSQL Development
StreamSQL, built on Flink SQL, provides a descriptive language, stable interfaces, easy debugging, and batch‑stream integration. It adds DDL support for various sources/sinks, built‑in parsers for binlog and JSON, extensible UDX/UDFs, and advanced join capabilities (TTL‑based dual‑stream join, dimension table joins).
Operational Support
IDE with SQL templates, UDF libraries, and online debugging.
Task operations: log retrieval (ES‑based), metric monitoring, alarm management, and lineage tracing.
Challenges & Future Outlook
Current challenges include initialization overhead, metric consistency between offline and real‑time, and governance of metric changes. Future work focuses on full batch‑stream integration via a unified MetaStore, enabling all engines (Hive, Spark, Presto, Flink) to share metadata and achieve seamless SQL development across batch and streaming.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.