Big Data 16 min read

Jingdong's Flink Real‑Time Computing Platform: Containerization, Optimizations, and Future Roadmap

This article details Jingdong's evolution from Storm to Flink, the architecture of its Kubernetes‑based real‑time computing platform, extensive containerization practices, performance and stability optimizations, and the future plan to unify batch‑stream processing while expanding SQL support and intelligent operations.

DataFunTalk

Dec 7, 2020

Jingdong's Flink Real‑Time Computing Platform: Containerization, Optimizations, and Future Roadmap

Introduction

Flink is a popular stream‑processing engine known for high throughput and low latency. Since 2018, Jingdong has built a high‑performance, stable, and easy‑to‑use real‑time computing platform on top of Flink + Kubernetes, supporting major sales events such as 618 and Double‑11.

Real‑Time Computing Evolution

Initially (2014) Jingdong used Storm for stream processing, adding cgroup isolation, network compression, and topology master sharding. By 2016 Storm became the main engine. In 2017 Spark Streaming was introduced for high‑throughput, low‑latency scenarios. In 2018 Flink was adopted to meet both low latency and high throughput requirements, with extensive customizations for performance, stability, usability, and functionality, and a new SQL platform to lower development barriers.

By 2020 the Flink‑K8s platform was mature, and batch processing support began to emerge, moving toward a unified batch‑stream architecture and AI‑driven real‑time analytics.

Platform Architecture

The Jingdong Real‑Time Computing (JRC) platform centers on a customized Flink running on Kubernetes. State is stored in an HDFS cluster, and Zookeeper ensures high availability. It ingests data from JDQ (a Kafka‑based real‑time data bus) and Hive, writes results to JimDB (an in‑memory DB), Elasticsearch, HBase, and the internal OLAP system. Jobs can be submitted as SQL statements or as ordinary JAR packages, with full configuration, deployment, debugging, monitoring, and log handling capabilities.

Business Scenarios & Scale

More than 70 primary business units use the platform for real‑time data warehouses, dashboards, recommendation, reporting, risk control, and monitoring, among other use cases. Over 3,000 streaming jobs run on a cluster of more than 5,000 physical machines, processing tens of billions of events per minute.

Flink Containerization Practice

Containerization started in 2018, achieving full containerization of compute units by early 2019. In 2020 the solution was upgraded to native Kubernetes, enabling elastic resource scaling. Benefits include better resource isolation, self‑healing, and easier elastic scheduling, as well as a new SQL platform that simplifies application development.

Original Containerization Scheme

The initial approach used Kubernetes Deployments to launch a standalone Flink session cluster, pre‑allocating JobManager and TaskManager resources. Zookeeper provided HA, HDFS stored state, Prometheus collected metrics, Grafana visualized them, and Elasticsearch stored logs.

Problems & Countermeasures

JM/TaskManager automatic recovery using liveness probes and pod restart policies.

Minimizing pod‑restart impact through failover strategies and selective task state reconstruction.

Performance tuning for JVM‑CPU/memory perception, load‑aware data dispatch, and optional host‑network usage.

Ensuring business stability via high‑availability, multi‑zone deployment, and resource pooling.

Zookeeper debounce: upgrading Curator and setting SessionConnectionStateErrorPolicy to avoid unnecessary leader loss.

Log separation: loading log‑related jars in the user classloader and using logback MDC for framework‑provided sinks.

Native Kubernetes Upgrade

The upgraded scheme lets JRC request a JobManager Deployment from the K8s master, then dynamically allocate TaskManager pods via JDResourceManager after a user submits a job through a REST service. This decouples the platform from specific container runtimes and preserves slot‑allocation semantics by pre‑estimating required resources.

Flink Optimizations

Four major improvement areas:

Performance

Stability

Usability

Feature extensions

Preview Topology

A visual topology preview allows users to adjust parallelism, slot grouping, and chaining without repeatedly editing command lines, ensuring stable operator IDs via UID hashing or deterministic graph‑based hashes.

Backpressure Quantification

Two methods were evaluated: UI‑based observation (limited by missing data, lack of history, and high‑parallelism overhead) and metric‑based collection (issues with version‑specific semantics and analysis complexity). Jingdong’s solution records backpressure location, time, and count, stores them, and correlates with the runtime topology for precise diagnosis.

HDFS Optimizations

To alleviate RPC latency and small‑file pressure, checkpoints are limited to a minimum interval (~1 min), small files are merged, RPC traffic during checkpoint creation/deletion is reduced, and the HDFS cluster is balanced across multiple name‑services.

Network Dispatch Optimization

A load‑aware dynamic rebalance selects the downstream channel with the smallest load, achieving up to a 2× performance boost in certain scenarios.

Zookeeper Debounce & Log Separation

Upgrading Curator and adjusting error policies mitigates unnecessary task restarts. Log separation isolates task logs from framework logs by loading log jars in the user classloader and using MDC for framework sinks.

Future Roadmap

Four focus areas:

Unify the compute engine by migrating all Storm workloads to Flink.

Expand SQL job support and lower the entry barrier for users.

Intelligent operations with self‑adaptive diagnostics to improve robustness.

Deepen batch‑stream integration, providing a unified platform with code reuse and reduced user cost.

The presentation concludes with thanks and a call for audience interaction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization Flink Kubernetes Containerization Real‑Time Computing Batch-Stream Integration

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.