Flink Stream‑Batch Integration: Layered Architecture, Unified SDK, DAG Scheduler, Shuffle, and Fault‑Tolerance
This article explains how Apache Flink has evolved into a unified stream‑batch engine by introducing a three‑layer architecture, a unified DataStream SDK, a pipeline‑region‑based DAG scheduler, a common shuffle framework, and enhanced fault‑tolerance mechanisms to address efficiency, consistency, and resource‑utilisation challenges in real‑time big‑data processing.
Apache Flink has become the de‑facto standard for real‑time big‑data computation, used by companies such as Alibaba, ByteDance, Tencent, Netflix, and Uber. The growing demand for real‑time analytics drives the need for a unified stream‑batch platform that can handle both unbounded and bounded data with a single code base.
Layered Architecture – Flink’s core engine is divided into three layers: the SDK layer (Relational SQL/Table and Physical DataStream SDKs), the execution‑engine layer (a unified DAG that is transformed into tasks by a unified DAG scheduler and communicates via a pluggable shuffle), and the state‑storage layer (RocksDB, Memory, and Batch state backends).
Unified Physical SDK – To reduce development and operational costs, Flink consolidates its Physical SDKs into a single Unified DataStream SDK, which supports both bounded and unbounded inputs while preserving low‑level control over state, timers, and operators.
Unified DAG Scheduler – By introducing the concept of Pipeline Regions, the scheduler can allocate resources and schedule tasks at a granularity that works for both streaming and batch jobs, combining the low‑latency benefits of streaming pipelines with the fault‑tolerance of batch blocking shuffles.
Unified Shuffle Architecture – Flink abstracts shuffle functionality into three components (Shuffle Master, Shuffle Writer, Shuffle Reader) that serve both streaming (in‑memory, low‑latency) and batch (disk‑based, durable) scenarios, eliminating duplicated development effort.
Fault‑Tolerance Strategies – New mechanisms such as Pipeline Region failover and JM failover (using operation logs) improve recovery for batch jobs while preserving exactly‑once guarantees for streaming jobs.
Future Outlook – Ongoing work focuses on next‑generation streaming architecture, adaptive scheduling, and broader community contributions to keep Flink at the forefront of unified big‑data processing.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.