Big Data 13 min read

Didi’s Real‑Time Computing Practices with Apache Flink and StreamSQL

Didi has unified its real‑time computing on Apache Flink, creating an enhanced StreamSQL service with extended DDL, built‑in parsers and UDX, supporting thousands of nodes, millions of jobs, and trillions of daily records, while addressing state management, high availability, multi‑language UDFs, and pursuing real‑time ML and data‑warehouse integration.

Didi Tech
Didi Tech
Didi Tech
Didi’s Real‑Time Computing Practices with Apache Flink and StreamSQL

Apache Flink is a distributed big‑data processing engine that can perform stateful computations on both bounded and unbounded data streams. It can be deployed in various cluster environments and handles data of any scale efficiently.

Didi has heavily optimized Apache Flink and added many features such as extended DDL, built‑in message format parsing, and extended UDX, enabling Flink to serve a wide range of Didi business scenarios. In this article, Liang Li‑Yin, senior technical expert and real‑time computing leader at Didi, shares the application and practice of Flink at Didi.

Main topics covered:

Service‑oriented overview

StreamSQL practice

Platform construction

Challenges and rules

1. Service Overview

Didi has built a comprehensive big‑data ecosystem that includes offline and real‑time systems such as the HBase ecosystem, Elasticsearch for data retrieval, and Kafka as a message queue. Based on Flink, Didi mainly develops StreamSQL, which will be introduced in detail later.

The evolution of Didi’s stream computing shows a transition from heterogeneous self‑built clusters (Storm, Spark Streaming, Samza, etc.) before 2017 to a unified, service‑oriented Flink‑based platform after 2017. By 2019, more than 50% of Didi’s streaming tasks were built on Flink via StreamSQL.

Didi’s real‑time computing services now support over 50 business lines, with clusters at the thousand‑node level, more than 3,000 streaming jobs, and processing trillions of records daily.

2. StreamSQL Practice

StreamSQL is an enhanced product built on top of Flink SQL, offering several advantages:

Declarative language – business users can describe logic without dealing with low‑level implementation.

Stable interface – SQL syntax remains unchanged across Flink version upgrades.

Easy troubleshooting – clear, SQL‑centric error locations.

Unified batch‑stream processing – shared syntax with HiveSQL and Spark SQL.

Low entry barrier – easy for developers to adopt.

Key improvements over vanilla Flink SQL include richer DDL support, built‑in message format parsers (e.g., for binlog, JSON), extended UDX libraries, and advanced join capabilities such as TTL‑based dual‑stream joins and dimension‑table joins.

3. Platform Construction

Didi built a StreamSQL IDE that provides an editor, SQL templates, UDF documentation, syntax checking, debugging (including local data upload), and version management. All streaming jobs are submitted through a web portal that manages the full lifecycle: submission, stopping, upgrading, and rollback, with on‑the‑fly parameter tuning (e.g., task manager memory).

Operational tooling includes log collection stored in Elasticsearch, external metric dashboards, alarm systems with threshold control, and lineage tracing across multi‑stage pipelines (source → stream → sink).

4. Challenges and Future Plans

Challenges faced by Didi’s Flink deployment include:

Large state management – checkpoint I/O overhead and lack of transparent state diagnostics.

Business high availability – need for seamless upgrades, rapid issue diagnosis, and elastic scaling during peak traffic.

Multi‑language support – most business teams use Go or Python, so Didi aims to provide multi‑language UDF development.

Future directions focus on:

Providing highly available streaming services for all online business.

Real‑time machine learning – moving from 10‑15 minute model updates to sub‑second updates.

Real‑time data warehouse – achieving real‑time reporting while keeping consistency with offline data, and gradually migrating older data to offline stores.

Source: Flink Forward ASIA (original article published by DataFunTalk).

Apache FlinkBig Datareal-time computingDidiStreamSQL
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.