Big Data 10 min read

JD's Flink Journey: Evolution, Optimizations, and Future Directions

This article details JD's adoption of Flink for real‑time computing, covering its evolution from Storm to Flink on Kubernetes, the platform architecture, major optimization techniques such as preview topology, backpressure handling, dynamic rebalance, checkpoint‑as‑savepoint, and outlines future plans including stream‑batch integration, stability improvements, intelligent operations, and AI integration.

DataFunTalk
DataFunTalk
DataFunTalk
JD's Flink Journey: Evolution, Optimizations, and Future Directions

Introduction On the first day of 2022, DataFunTalk presents a technical sharing session by JD expert Fu Haitao, focusing on JD's practical experience with Flink, a popular streaming engine used in real‑time data warehousing, risk control, and recommendation scenarios.

1. Evolution and Application

JD started with Storm in 2014, moved to Spark Streaming in 2017, and adopted Flink with Kubernetes in 2018 to achieve low latency, high throughput, stateful computation, and exactly‑once semantics. By 2019 the entire real‑time platform ran on Kubernetes, and JD built a new SQL platform based on Flink 1.8. In 2020‑2021 JD unified engines, added intelligent diagnosis and elastic scaling, and extended the platform to support batch processing, moving toward a unified stream‑batch architecture.

The platform architecture places Flink at its core, deployed on a K8s cluster, with state stored in HDFS and high availability provided by ZooKeeper. It integrates JD's proprietary message queue JDQ and can write results to Hive, HBase, and other storage systems.

2. Flink Optimization Improvements

2.1 Preview Topology Users can preview the job topology after submission, adjust operator parallelism, group slots, and view network resource usage, enabling convenient task tuning.

2.2 Backpressure Quantification JD identified limitations of the default Flink UI backpressure view (incomplete collection, no history, non‑intuitive impact) and enhanced monitoring by exposing backpressure metrics (position, time, count) to Prometheus, allowing historical analysis.

2.3 File System Multi‑Configuration Support To allow shared clusters while isolating configurations for different services, JD introduced schema‑based isolation for file system configurations, enabling seamless reads/writes across multiple OSS stores.

2.4 Data Distribution Optimization When upstream and downstream parallelism differ, the default rebalance can be sub‑optimal. JD implemented two improvements:

Dynamic rebalance based on downstream load, directing data to the fastest downstream tasks, achieving up to a 2× performance boost in unbalanced scenarios.

Using rescale instead of rebalance when parallelism ratios are proportional, reducing network buffers and improving fault‑tolerance via Flink's region mechanism.

2.5 Last Checkpoint as Savepoint JD developed a feature that persists the last checkpoint as a savepoint, allowing jobs to resume from the most recent checkpoint after a stop, and automatically cleaning up the previous checkpoint after a successful new checkpoint.

2.6 Other Optimizations - HDFS small‑file merging and buffer tuning. - ZooKeeper debounce to avoid unnecessary restarts under network jitter. - Task‑level failover for localized recovery. - Cluster task isolation to prevent cross‑job interference. - Enhanced logging with dynamic level control. - SQL extensions (incremental window, offset support). - Intelligent diagnosis that automatically analyses jobs and provides recommendations.

3. Future Plans

JD aims to pursue four directions:

Stream‑batch integration to support low‑latency streaming and high‑performance batch processing within a single engine.

Improving stability by reducing the overhead of Flink's recovery mechanisms in containerized environments.

Intelligent operations that automatically adjust parameters and perform elastic scaling based on runtime diagnostics.

Exploring AI integration to enable real‑time, intelligent AI scenarios on top of Flink.

Thank you for listening.

OptimizationBig DataFlinkKubernetesStreamingReal-Time ComputingJD
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.