Big Data 20 min read

Real-time Data Warehouse Construction for Didi Ride-hailing's Carpool Service

This article details Didi's end‑to‑end real‑time data warehouse design for the carpool business, covering its objectives, architecture layers from ODS to application, naming conventions, StreamSQL development, operational tooling, challenges faced, and future batch‑stream integration plans.

DataFunTalk
DataFunTalk
DataFunTalk
Real-time Data Warehouse Construction for Didi Ride-hailing's Carpool Service

The rapid growth of Didi's services has heightened the need for timely data, prompting extensive experimentation and practice in real‑time technologies. Using the carpool (顺风车) business as a case study, the article explains the purpose, requirements, and benefits of building a real‑time data warehouse.

Purpose and Motivation – Real‑time data supports fine‑grained operational decisions, improves product iteration speed, and enables instant business monitoring such as safety metrics, financial indicators, and promotional effectiveness.

Key Problems with Traditional Warehouses – Offline warehouses suffer from low data freshness, lack of standardized real‑time pipelines, and high resource waste. Real‑time solutions aim to address these gaps.

Application Scenarios – Real‑time OLAP analysis, live dashboards, business monitoring (safety, finance, complaints), and real‑time data‑service APIs.

Architecture Overview – The warehouse mirrors traditional layered designs (ODS, DWD, DIM, DWM, APP) but with fewer layers and different storage choices: Kafka/DDMQ for streams, HBase/MySQL/KV stores for dimensions, Druid for OLAP queries, and ClickHouse for aggregation.

Layer Details

ODS (source layer): Ingests binlog, public logs, and traffic events into Kafka topics, following naming conventions like cn-binlog‑{db}-{table} or realtime_ods_binlog_{source} .

DWD (detail layer): Builds fine‑grained fact tables, performs stream‑SQL ETL, joins, and writes to Kafka and Druid.

DIM (dimension layer): Stores shared dimensions in MySQL, HBase, or Didi's Fusion KV store, with naming pattern dim_{biz}_{dim}[_{tag}] .

DWM (summary layer): Generates aggregated metrics (PV, UV, etc.) using minute‑level StreamSQL windows, stores results in Druid or other stores, and follows naming pattern realtime_dwm_{biz}_{domain}_{grain}_{tag}_{period} .

APP (application layer): Pushes final aggregates to downstream databases (MySQL, Redis, HBase) for dashboards and API services.

StreamSQL Platform – Didi's StreamSQL extends Flink SQL with richer DDL, built‑in parsers for binlog, business logs, and JSON, and supports TTL‑based joins, dimension joins, and custom UDX/UDFs, enabling batch‑stream convergence.

Operational Tooling – Includes a StreamSQL IDE with templates, UDF libraries, online debugging, task monitoring (log search, metric dashboards, alerts), and lineage tracing to diagnose pipeline issues.

Challenges and Solutions – Standardizing real‑time development processes, ensuring consistency between offline and online data, handling initialization overhead, and improving notification mechanisms for metric changes.

Future Outlook – Pursue full batch‑stream integration via a unified MetaStore for Hive, Kafka, HBase, etc., allowing the same SQL to run on both batch and streaming engines.

Team and Author Information – The work involves Didi's Cloud Platform Big Data Architecture, Platform, and Real‑time Data Warehouse teams, with authors experienced in data modeling, warehouse construction, and streaming technologies.

Big DataFlinkStream ProcessingReal-time Data WarehouseDidicarpool
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.