Big Data 18 min read

Improving Flink Unaligned Checkpoint: Problems, Principles, Optimizations, and Production Practices at Shopee

Shopee tackled frequent Flink checkpoint failures caused by back‑pressure by adopting and extending the community’s Unaligned Checkpoint mechanism—adding overdraft buffers, improving legacy sources, introducing an aligned‑checkpoint timeout, enabling output‑buffer switching, merging small HDFS files, and fixing network‑buffer deadlocks—now running hundreds of jobs with stable UC deployment and plans to enable it universally.

Shopee Tech Team
Shopee Tech Team
Shopee Tech Team
Improving Flink Unaligned Checkpoint: Problems, Principles, Optimizations, and Production Practices at Shopee

Flink is a benchmark stream‑processing engine that guarantees exactly‑once semantics through Checkpoint and State. In production, Shopee encountered many Checkpoint problems and explored the community’s Unaligned Checkpoint (UC) mechanism, eventually extending it internally and contributing the improvements back to the Flink project.

This article introduces the issues with traditional (aligned) Checkpoint, the principle of Unaligned Checkpoint, Shopee’s enhancements, community contributions, and practical deployment experience.

1. Problems with Checkpoint

Severe back‑pressure often causes Checkpoint timeout failures. During high‑traffic periods, external query or write bottlenecks, CPU limits, and data skew can lead to continuous Checkpoint failures.

Consequences include:

Repeated consumption of lagged data, wasting resources.

Potential dead‑loops when the tolerated‑failed‑checkpoints count is low.

Connector‑specific issues (e.g., Kafka producer transaction timeouts) when buffers cannot be flushed.

Large amounts of duplicate data after job restarts.

2. Unaligned Checkpoint Principle

When data flow is slow, an Aligned Checkpoint (AC) stalls because the Barrier queues in the data stream. UC allows the Barrier to “overtake” data buffers, reaching the sink quickly.

UC Core Idea: If the data stream is slow, the Barrier bypasses buffers so it can travel from source to sink without waiting for data to be processed.

3. Detailed UC Flow

When a Task receives a Barrier on any InputChannel, it immediately starts the UC synchronization phase, without waiting for other channels or processing the buffered data.

During the synchronization phase the Task:

Barriers overtake all input and output buffers.

Snapshots the overtaken buffers.

Calls the operator’s snapshotState method.

Flink engine snapshots the operator’s internal State.

After synchronization, the Task continues processing data while the asynchronous phase writes the snapshot and buffers to HDFS.

4. Practical Issues and Shopee’s Solutions

4.1 Overdraft Buffer

When a Task needs multiple buffers for a single record, it may still block. Shopee proposed an “overdraft buffer” where the Task can borrow buffers from the TaskManager if the local pool is empty.

Configuration example:

taskmanager.network.memory.max-overdraft-buffers-per-gate=5

This allows up to five borrowed buffers per gate, preventing UC stalls in multi‑buffer scenarios.

4.2 Legacy Source Improvement

Legacy Sources push data downstream without checking OutputBufferPool, causing UC failures. Shopee modified Legacy Sources to check for free buffers before emitting data, similar to the newer pull‑based sources.

4.3 Aligned Checkpoint Timeout

To avoid the risk of always using UC, an AC timeout mechanism was introduced: start with AC, and if it does not finish within a configured time (e.g., 1 minute), switch to UC.

aligned-checkpoint-timeout=1min

This reduces UC‑related risks while still benefiting from UC under heavy back‑pressure.

4.4 Output Buffer Switching

Output buffers previously could not switch from AC to UC. Shopee added a timer‑based approach: if a Barrier remains in the output buffer beyond the timeout, it is moved to the buffer head and snapshot together with the overtaken buffers.

4.5 Small‑File Merging

UC can generate a large number of small HDFS files (one per sub‑task). Shopee introduced file sharing:

channel-state.number-of-tasks-share-file=5

Five Tasks share a single UC file, reducing the file count by fivefold. Thread‑safety is ensured by locking the shared file stream.

4.6 Network Buffer Deadlock Fix

A deadlock in network‑buffer reclamation (FLINK‑22946) was resolved, further stabilizing UC deployments.

5. Production Practice at Shopee

Shopee sets aligned-checkpoint-timeout to 1 minute, enabling AC by default and falling back to UC only when necessary. The platform provides a UI switch for UC, and hundreds of Flink jobs now run with UC enabled, showing stable behavior even during spikes.

6. Future Plans

After months of stable operation, Shopee aims to enable UC for all jobs under the AC‑timeout policy and to allocate dedicated network memory for overdraft buffers.

Authors: Rui and Guichao, Shopee Data Infrastructure Team.

Flinkstream processingBig DataShopeeCheckpoint OptimizationUnaligned Checkpoint
Shopee Tech Team
Written by

Shopee Tech Team

How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.