Improving Flink Unaligned Checkpoint: Problems, Principles, Optimizations, and Production Practices at Shopee
Shopee tackled frequent Flink checkpoint failures caused by back‑pressure by adopting and extending the community’s Unaligned Checkpoint mechanism—adding overdraft buffers, improving legacy sources, introducing an aligned‑checkpoint timeout, enabling output‑buffer switching, merging small HDFS files, and fixing network‑buffer deadlocks—now running hundreds of jobs with stable UC deployment and plans to enable it universally.
Flink is a benchmark stream‑processing engine that guarantees exactly‑once semantics through Checkpoint and State. In production, Shopee encountered many Checkpoint problems and explored the community’s Unaligned Checkpoint (UC) mechanism, eventually extending it internally and contributing the improvements back to the Flink project.
This article introduces the issues with traditional (aligned) Checkpoint, the principle of Unaligned Checkpoint, Shopee’s enhancements, community contributions, and practical deployment experience.
1. Problems with Checkpoint
Severe back‑pressure often causes Checkpoint timeout failures. During high‑traffic periods, external query or write bottlenecks, CPU limits, and data skew can lead to continuous Checkpoint failures.
Consequences include:
Repeated consumption of lagged data, wasting resources.
Potential dead‑loops when the tolerated‑failed‑checkpoints count is low.
Connector‑specific issues (e.g., Kafka producer transaction timeouts) when buffers cannot be flushed.
Large amounts of duplicate data after job restarts.
2. Unaligned Checkpoint Principle
When data flow is slow, an Aligned Checkpoint (AC) stalls because the Barrier queues in the data stream. UC allows the Barrier to “overtake” data buffers, reaching the sink quickly.
UC Core Idea: If the data stream is slow, the Barrier bypasses buffers so it can travel from source to sink without waiting for data to be processed.
3. Detailed UC Flow
When a Task receives a Barrier on any InputChannel, it immediately starts the UC synchronization phase, without waiting for other channels or processing the buffered data.
During the synchronization phase the Task:
Barriers overtake all input and output buffers.
Snapshots the overtaken buffers.
Calls the operator’s snapshotState method.
Flink engine snapshots the operator’s internal State.
After synchronization, the Task continues processing data while the asynchronous phase writes the snapshot and buffers to HDFS.
4. Practical Issues and Shopee’s Solutions
4.1 Overdraft Buffer
When a Task needs multiple buffers for a single record, it may still block. Shopee proposed an “overdraft buffer” where the Task can borrow buffers from the TaskManager if the local pool is empty.
Configuration example:
taskmanager.network.memory.max-overdraft-buffers-per-gate=5This allows up to five borrowed buffers per gate, preventing UC stalls in multi‑buffer scenarios.
4.2 Legacy Source Improvement
Legacy Sources push data downstream without checking OutputBufferPool, causing UC failures. Shopee modified Legacy Sources to check for free buffers before emitting data, similar to the newer pull‑based sources.
4.3 Aligned Checkpoint Timeout
To avoid the risk of always using UC, an AC timeout mechanism was introduced: start with AC, and if it does not finish within a configured time (e.g., 1 minute), switch to UC.
aligned-checkpoint-timeout=1minThis reduces UC‑related risks while still benefiting from UC under heavy back‑pressure.
4.4 Output Buffer Switching
Output buffers previously could not switch from AC to UC. Shopee added a timer‑based approach: if a Barrier remains in the output buffer beyond the timeout, it is moved to the buffer head and snapshot together with the overtaken buffers.
4.5 Small‑File Merging
UC can generate a large number of small HDFS files (one per sub‑task). Shopee introduced file sharing:
channel-state.number-of-tasks-share-file=5Five Tasks share a single UC file, reducing the file count by fivefold. Thread‑safety is ensured by locking the shared file stream.
4.6 Network Buffer Deadlock Fix
A deadlock in network‑buffer reclamation (FLINK‑22946) was resolved, further stabilizing UC deployments.
5. Production Practice at Shopee
Shopee sets aligned-checkpoint-timeout to 1 minute, enabling AC by default and falling back to UC only when necessary. The platform provides a UI switch for UC, and hundreds of Flink jobs now run with UC enabled, showing stable behavior even during spikes.
6. Future Plans
After months of stable operation, Shopee aims to enable UC for all jobs under the AC‑timeout policy and to allocate dedicated network memory for overdraft buffers.
Authors: Rui and Guichao, Shopee Data Infrastructure Team.
Shopee Tech Team
How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.