Big Data 20 min read

Real-time Data Warehouse Construction at Didi: Architecture, Practices, and Lessons

To support Didi’s fast‑growing car‑pool service, a real‑time data warehouse was built using a streamlined layered architecture—ODS, DWD, DIM, DWM, and APP—leveraging Flink‑based StreamSQL, Kafka, Druid and ClickHouse to deliver minute‑level analytics, dashboards, monitoring, and cross‑business interfaces while planning unified meta‑store integration.

Didi Tech

Aug 26, 2020

Real-time Data Warehouse Construction at Didi: Architecture, Practices, and Lessons

Didi’s rapid business growth has driven an increasing demand for timely data. To meet this need, Didi has undertaken extensive experiments and practices in real‑time data warehouse (real‑time DW) construction, using the car‑pool (顺风车) business as a case study.

Purpose of Real‑time DW

With the internet entering its second half, data timeliness becomes crucial for fine‑grained operations. Real‑time DW enables rapid extraction of valuable insights from massive daily data, supporting faster decision‑making, product iteration, and operational adjustments.

Problems of Traditional DW

Traditional warehouses focus on historical data accumulation, often lagging behind real‑time business needs. Real‑time DW aims to combine warehouse theory with streaming technologies to address low data latency, improve data availability, and reduce resource waste.

Key Application Scenarios

Real‑time OLAP analysis using Flink‑based StreamSQL, Kafka, DDMQ, Druid, ClickHouse.

Real‑time dashboards for order and coupon metrics.

Real‑time business monitoring (safety, finance, complaints).

Real‑time data interface services for cross‑business collaboration.

Architecture Overview

The real‑time DW for the car‑pool business follows a layered structure similar to offline warehouses but with fewer layers to reduce latency.

Layer Details

1. ODS (Source Layer)

Data sources include order binlog, public logs, and traffic logs, ingested into Kafka or DDMQ. Naming conventions: cn-binlog-<database>-<table> for auto‑generated topics and realtime_ods_binlog_<source>/ods_log_<log> for custom topics.

2. DWD (Detail Layer)

Fact tables are built per business process, with selective dimension redundancy for wide tables. Data is processed via StreamSQL, stored in Kafka and optionally written to Druid for query and aggregation.

Naming pattern: realtime_dwd_<business>_<domain>_[<process>_]_<tag> (e.g., realtime_dwd_trip_trd_order_base).

3. DIM (Dimension Layer)

Consistent dimension tables built using modeling principles, sourced from Flink‑processed ODS data and offline jobs. Storage options include MySQL, HBase, and Didi’s Fusion KV store. Naming pattern: dim_<business>_<dimension>_[<tag>] (e.g., dim_trip_dri_base).

4. DWM (Summary Layer)

Aggregated metrics (PV, UV, order counts) are computed per theme, with minute‑level granularity. Druid is used for UV de‑duplication, while custom aggregation logic runs in Flink. Naming pattern: realtime_dwm_<business>_<domain>_<grain>_<tag>_<interval> (e.g., realtime_dwm_trip_trd_pas_bus_accum_1min).

5. APP (Application Layer)

Real‑time summary data is written to downstream systems such as Druid for dashboards, HBase for interface services, and MySQL/Redis for product features. No strict naming constraints are imposed.

StreamSQL Development

StreamSQL, built on Flink SQL, provides a descriptive language, stable interfaces, easy debugging, and batch‑stream integration. It adds DDL support for various sources/sinks, built‑in parsers for binlog and JSON, extensible UDX/UDFs, and advanced join capabilities (TTL‑based dual‑stream join, dimension table joins).

Operational Support

IDE with SQL templates, UDF libraries, and online debugging.

Task operations: log retrieval (ES‑based), metric monitoring, alarm management, and lineage tracing.

Challenges & Future Outlook

Current challenges include initialization overhead, metric consistency between offline and real‑time, and governance of metric changes. Future work focuses on full batch‑stream integration via a unified MetaStore, enabling all engines (Hive, Spark, Presto, Flink) to share metadata and achieve seamless SQL development across batch and streaming.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink stream processing Data Platform Real-time Data Warehouse Big Data Architecture StreamSQL

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.