Stability Optimization Practices for Flink Jobs at Tencent
This article presents Tencent's practical experience in improving Flink job stability, covering the Oceanus platform, stability challenges, and concrete optimization techniques such as reducing failures, minimizing impact, accelerating recovery, and proactive issue detection, followed by a summary and future outlook.
Speaker: Qiu Congxian, Senior Development Engineer, Tencent
Organizer: DataFunTalk
Overview: The session focuses on practical stability optimization for Flink jobs, including the following topics:
Flink usage at Tencent
Introduction to Flink stability
Stability optimization practices
Summary & outlook
01 Flink Usage at Tencent
Flink is deployed on Tencent's internal one‑stop real‑time computing platform called Oceanus. Oceanus supports Jar, SQL, and canvas jobs, runs on Yarn and Kubernetes, stores data in HDFS, receives input from MQ and outputs to MQ and various services.
Typical application scenarios of Flink at Tencent are shown in the following diagrams.
02 Flink Stability Overview
The Flink runtime consists of a Master (scheduling), a resource pool (Yarn/Kubernetes), HA managed by Zookeeper, and Tasks that execute user logic. During checkpointing, task state is backed up to HDFS.
Master handles job scheduling
Resource pool: Yarn and K8s
HA: Zookeeper
Task runs user logic
Checkpoint copies state to HDFS
Stability can be affected by third‑party systems (Yarn, HDFS, Zookeeper) and Flink itself (control chain, data chain, back‑pressure, bugs, user logic errors).
Third‑party factors: Yarn container count & resources, HDFS storage & access, Zookeeper connections
Flink factors: Master‑Worker communication, checkpoint flow, data hotspots, back‑pressure, bugs, user errors
03 Stability Optimization Practices
Optimization is divided into three aspects:
Reduce failures – e.g., improve Zookeeper HA protocol, merge small checkpoint files to reduce HDFS RPCs.
Lower impact – limit the number of affected tasks (single‑task restart, Master failover without full job restart) and reduce downtime (accelerate job start).
Fast detection & recovery – automatic diagnosis system, proactive monitoring.
(1) Reduce failures – Optimizing Zookeeper HA
Before optimization, every TaskManager kept a long‑lived connection to Zookeeper, causing high connection counts. After optimization, only the Master maintains a Zookeeper connection; TaskManagers disconnect, dramatically reducing Zookeeper load.
The master‑failover process is illustrated below.
After the new Master registers with Zookeeper, TaskManagers detect heartbeat timeout, reconnect to Zookeeper, obtain the new Master address, and then drop the Zookeeper connection.
(2) Lower impact – Single‑task restart
When a task fails, only that task is restarted while other tasks continue running. This avoids global failover and reduces data loss. Diagrams show the workflow and latency differences between global and single‑task restarts.
Experimental results indicate that single‑task restart reduces perceived data interruption from 138 seconds to near zero, though it may be slightly lossy.
(3) Fast recovery – Startup acceleration
Three bottlenecks are identified: Master RPC handling, container file fetching, and container provisioning. Optimizations include reducing unnecessary RPCs, merging small dependency files, and allowing extra standby containers. Recovery time drops from ~200 s to 48 s.
(4) Proactive issue detection
Traditional reactive alerts (heartbeat timeout, checkpoint expire) require manual log/metric inspection. Tencent built an integrated diagnostics system combining Logs, Metrics, and Traces to automatically pinpoint root causes (e.g., OOM killer, sync snapshot) and suggest remedies.
04 Summary & Outlook
We addressed stability from three angles: reducing failures, lowering impact, and fast detection & recovery. Future work includes loss‑less single‑task restart, rapid state recovery for large states, and further automation of diagnosis and remediation.
Thank you for attending.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.