Big Data 12 min read

Practical Implementation of Flink on Kubernetes for Data Integration at Li Auto

This article details Li Auto's end‑to‑end data integration practice using Flink on Kubernetes, covering the evolution of their integration platform, architectural design, cloud‑native deployment, operational challenges, and future roadmap, while highlighting unified batch‑stream processing and resource elasticity.

DataFunTalk
DataFunTalk
DataFunTalk
Practical Implementation of Flink on Kubernetes for Data Integration at Li Auto

The presentation introduces Li Auto's data integration journey, which progressed through four stages: an offline DataX‑based exchange in July 2020, a Flink real‑time platform in July 2021, a first integration pipeline (Kafka → Hive) in September 2022, and a unified batch‑stream capability added in April 2023.

Early on, the company faced fragmented data products, multiple heterogeneous engines (DataX, Flink, Spark, etc.), and difficulties sharing resources, leading to low utilization and complex development across different languages and platforms.

To address these pain points, three core requirements were defined: a unified platform that abstracts heterogeneous sources, a single compute engine handling both batch and streaming, and separation of compute and storage for independent elastic scaling.

Flink was chosen as the compute engine because its batch‑stream integration and Kubernetes‑native features enable seamless switching and elastic resource management. The platform now supports various sources (TiDB, MySQL, Oracle, Kafka, etc.) and sinks (Hive, StarRocks, MongoDB) through standardized connectors.

The architecture consists of a storage layer (JuiceFS + BOS) and a compute layer built on Flink Operator, which packages Flink into a container image, manages deployments via a custom resource definition, and provides history services for task analysis.

Design-wise, data integration is modeled as source‑to‑sink plugins; users only need to define the source, transformation, and target sink, without writing code. Typical scenarios include offline table synchronization, handling large Oracle full‑load jobs with size‑based parallelism, and simplifying real‑time pipelines by configuring sources and sinks.

For cloud‑native deployment, Flink Operator handles lifecycle management, status monitoring, and checkpointing. It integrates with Prometheus for metrics and alerts, and uses JuiceFS for shared storage, enabling checkpoint persistence and stateful upgrades.

Future plans involve adding more heterogeneous data sources, improving elastic scaling beyond current Kubernetes capabilities, enhancing massive data transfer performance, and addressing Flink's lack of predicate push‑down for filtered batch jobs.

cloud-nativebig dataFlinkkubernetesBatch Processingstreamingdata integration
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.